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Consensus  is  a  decision  problem  in  which  n  processors,  each  starting  with  a 
value  not  known  to  the  others,  must  collectively  agree  on  a  single  value.  If  the 
initial  values  are  equal,  the  processors  must  agree  on  that  common  value;  this 
is  the  validity  condition.  A  consensus  protocol  is  wait-free  if  every  proces¬ 
sor  finishes  in  a  finite  number  of  its  own  steps  regardless  of  the  relative  speeds 
of  the  other  processors,  a  condition  that  precludes  the  use  of  traditional  syn¬ 
chronization  techniques  such  as  critical  sections,  locking,  or  leader  election. 
Wait -free  consensus  is  fundamental  to  synchronization  without  mutual  ex¬ 
clusion,  as  it  can  be  used  to  construct  wait-free  implementations  of  arbitrary 
concurrent  data  structures.  It  is  known  that  no  deterministic  algorithm  for 
wait-free  consensus  is  possible,  although  many  randomized  algorithms  have 
been  proposed. 


I  present  two  algorithms  for  solving  the  wait-free  consensus  problem  in  the 
standard  asynchronous  shared-memory  model.  The  first  is  a  very  simple 
protocol  based  on  a  random  walk.  The  second  is  a  protocol  based  on  weighted 
voting,  in  which  each  processor  executes  0(n  log2n)  expected  operations. 
This  bound  is  close  to  the  trivial  lower  bound  of  fl(n),  and  it  substantially 
improves  on  the  best  previously-known  bound  of  0{n 2  logn),  due  to  Bracha 
and  Rachman. 
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Chapter  1 
Introduction 


Consensus  [CIL87]  is  a  tool  for  allowing  a  group  of  processors  to  collectively 
choose  one  value  from  a  set  of  alternatives.  It  is  defined  as  a  decision  problem 
in  which  n  processors,  each  starting  with  a  value  (0  or  1)  not  known  to 
the  others,  must  collectively  agree  on  a  single  value.  (The  restriction  to  a 
single  bit  does  not  prevent  the  processors  from  choosing  between  more  than 
two  possibilities  since  they  can  run  just  run  a  one-bit  consensus  protocol 
multiple  times.)  The  processors  communicate  by  reading  from  and  writing 
to  a  collection  of  registers;  each  processor  finishes  the  protocol  by  deciding 
on  a  value  and  halting.  A  consensus  protocol  is  wait-free  if  each  processor 
makes  its  decision  after  a  finite  number  of  its  own  steps,  regardless  of  the 
relative  speeds  or  halting  failures  of  the  other  processors.  In  addition,  a 
consensus  protocol  must  satisfy  the  validity  condition:  if  every  processor 
starts  with  the  same  input  value,  every  processor  decides  on  that  value.  This 
condition  excludes  trivial  protocols  such  as  one  where  every  processor  always 
decides  0. 

The  asynchronous  shared-memory  model  is  an  attempt  to  capture  the 
effect  of  making  the  weakest  possible  assumptions  about  the  timing  of  events 
in  a  distributed  system.  At  each  moment  an  adversary  scheduler  chooses  one 
of  the  n  processors  to  run.  No  guarantees  are  made  about  the  scheduler's 
choices —  it  may  start  and  stop  processors  at  will,  based  on  a  total  knowledge 
of  the  state  of  the  system,  including  the  contents  of  the  registers,  the  pro¬ 
gramming  of  the  processors,  and  even  the  internal  states  of  the  processors. 
Since  the  scheduler  can  always  simulate  a  halting  failure  by  choosing  not  to 
run  a  processor,  the  model  effectively  allows  up  to  n  —  1  halting  failures.  The 
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adversary’s  power,  however,  is  not  unlimited.  It  cannot  cause  the  processors 
to  deviate  from  their  programming  or  cause  operations  on  the  registers  to 
fail  or  return  incorrect  values. 

Combined  with  the  requirement  that  a  consensus  protocol  terminate  af¬ 
ter  a  finite  number  of  operations,  the  adversary's  power  precludes  the  use  of 
traditional  synchronization  techniques  such  as  critical  sections,  locking,  or 
leader  election:  any  processor  that  obtains  a  critical  resource  can  be  killed, 
and  as  soon  as  any  processor  or  group  of  processors  is  given  control  over  the 
outcome  of  the  protocol,  the  scheduler  can  put  them  to  sleep  and  leave  the 
other  processors  helpless  to  complete  the  protocol  on  their  own.  In  general, 
any  protocol  depending  on  mutual  exclusion,  where  one  processor’s  pos¬ 
session  of  a  resource  or  role  depends  on  other  processors  being  excluded  from 
it,  will  not  be  wait-free. 

Wait-free  consensus  is  fundamental  to  synchronization  without  mutual 
exclusion  and  thus  lies  at  the  heart  of  the  more  general  problem  of  con¬ 
structing  highly  concurrent  data  structures  [Her91j.  It  can  be  used  to  obtain 
wait-free  implementations  of  arbitrary  abstract  data  types  with  atomic  oper¬ 
ations  (Her91,  PIo89j.  It  is  also  complete  for  distributed  decision  tasks 
.[CM89]  in  the  sense  that  it  can  be  used  to  solve  all  such  decision  tasks  that 
have  a  wait-free  solution.  Intuitively,  the  processors  can  individually  simulate 
a  sequence  of  operations  on  an  object  or  the  computation  of  a  decision  value, 
and  use  consensus  protocols  to  choose  among  the  possibly  distinct  outcomes. 
Conversely,  if  consensus  is  not  possible,  it  is  also  impossible  to  construct 
wait-free  implementations  for  many  simple  abstract  data  types,  including 
queues,  test-and-set  bits,  or  compare-and-swap  registers,  as  there  exist  sim¬ 
ple  deterministic  consensus  protocols  (for  bounded  numbers  of  processors) 
using  these  primitives  [Her91]. 

Alas,  given  the  powerful  adversary  of  the  asynchronous  shared-memory 
model  it  is  not  possible  to  have  a  deterministic  wait-free  consensus  protocol, 
one  in  which  the  behavior  of  the  processors  is  predictable  in  advance.  In 
fact,  in  a  wide  variety  of  asynchronous  models  it  has  been  shown  there  is 
no  deterministic  consensus  protocol  that  works  against  a  scheduler  that  can 
stop  even  a  single  processor  (CIL87,  DDS87,  FLP85,  Her91,  LAA87,  TMS9]. 
Though  this  result  is  usually  proved  using  more  general  techniques,  when 
only  single-writer  registers  are  used  it  has  a  simple  proof  that  illustrates 
many  of  the  problems  that  arise  when  trying  to  solve  wait-free  consensus. 

Imagine  that  two  processors  A  and  B  are  trying  to  solve  the  consensus 
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problem.  Their  situation  is  very  much  like  the  situation  of  two  people  facing 
each  other  in  a  narrow  hallway;  neither  person  has  any  stake  in  whether 
they  pass  on  the  left  or  the  right,  but  if  one  goes  left  and  the  other  right 
they  will  bump  into  each  other  and  make  no  progress.  When  A  and  B  are 
deterministic  processors  under  the  control  of  a  malicious  adversary  scheduler, 
we  can  show  the  scheduler  will  be  able  to  use  its  knowledge  of  their  state  and 
its  control  over  the  timing  of  events  to  keep  ,4  and  B  oscillating  back  and 
forth  forever  between  the  two  possible  decision  values. 

Here  is  what  happens.  Since  each  processor  is  deterministic,  at  any  given 
point  in  Mme  it  has  some  preference,  defined  as  the  value  ("left”  or  ‘‘right” 
in  the  hallway  example)  that  it  will  eventually  choose  if  the  other  processor 
executes  no  more  operations  [AH90a].1  At  the  beginning  of  the  protocol, 
each  processor’s  preference  is  equal  to  its  input,  because  without  knowing 
that  some  other  processor  has  a  different  input  it  must  cautiously  decide  on 
its  own  input  to  avoid  violating  the  validity  condition.  So  we  can  assume 
that  initially  processor  A  prefers  to  pass  on  the  left,  and  processor  B  on  the 
right. 

Now  the  scheduler  goes  to  work.  It  stops  B  and  runs  A  by  itself.  After 
some  finite  number  of  steps,  A  must  make  a  decision  (to  go  left)  and  halt,  or 
the  termination  condition  will  be  violated.  But  before  .4  can  finish,  it  must 
make  sure  that  B  will  make  the  same  decision  it  makes,  or  the  consistency- 
condition  will  be  violated.  So  at  some  point  A  must  ted  B  something  that 
will  cause  B  to  change  its  preference  to  “left”,  and  in  the  shared-memory 
model  this  message  must  take  the  form  of  a  write  operation  (since  B  can  t 
see  when  A  does  a  read  operation).  Immediately  before  A  carries  out  this 
critical  write,  the  scheduler  stops  A  and  starts  B. 

This  action  puts  B  in  the  same  situation  that  .4  was  in.  B  still  prefers  to 
go  right,  and  after  some  finite  number  of  steps  it  must  tell  .4  to  change  its 
preference  to  “right”.  When  this  point  is  reached  either  one  of  two  conditions 
holds:  either  B  has  done  something  to  neutralize  A’s  still  undelivered  demand 
that  B  change  it  preference,  in  which  case  the  scheduler  just  stops  B  and 
runs  A  again,  or  both  A  and  B  are  about  to  deliver  writes  that  will  cause 
the  other  to  change  its  preference.  In  this  case,  the  scheduler  allows  both  of 

‘This  unfortunate  possibility  is  unlikely  to  occur  in  the  real-world  hallway  situation, 
assuming  healthy  participants,  but  it  is  allowed  by  the  asynchronous  shared-memory  model 
since  the  adversary  can  always  choose  never  to  run  the  other  processor  again. 
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the  writes  to  go  through,  and  now  A  prefers  to  go  right  and  B  prefers  to  go 
left,  putting  the  two  processors  back  where  they  started  with  roles  reversed. 
In  effect,  the  adversary  uses  its  power  over  the  timing  of  events  to  make  sure 
that  just  when  A  gives  in  and  agrees  to  adopt  B’s  position,  B  does  exactly 
the  same  thing,  and  so  on  ad  infinitum. 

Fortunately,  human  beings  do  not  appear  to  be  controlled  by  an  adver¬ 
sary  scheduler,  so  in  real  life  one  hardly  ever  sees  two  people  bouncing  in 
unison  from  one  side  of  a  hallway  to  the  other  for  more  than  a  few  iterations. 
Processors  that  do  not  have  even  the  illusion  of  free  will  can  nonetheless  get 
some  of  the  same  effect  using  randomization.  Imagine  in  the  hallway  situa¬ 
tion  that  A  had  the  ability  to  tell  B  to  flip  a  fair  coin  to  set  its  new  preference. 
No  matter  what  A’s  preference  was  (or  changed  to),  there  would  be  a  50% 
chance  that  the  result  of  B's  coin  flip  would  match  A’s  preference.  In  fact, 
if  both  processors  were  continually  flipping  coins  to  change  their  preferred 
value,  a  run  of  identical  coin  flips  would  soon  occur  that  wa s  long  enough 
that  the  two  processors  would  be  able  to  notice  that  they  were  in  agreement, 
and  the  protocol  would  terminate.  Though  this  rough  description  leaves  out 
many  important  details,  it  gives  the  basic  idea  behind  the  first  randomized 
consensus  protocol  for  the  asynchronous  shared-memory  model,  due  to  Abra- 
bamson  [Abr88].  The  only  drawback  of  the  approach  is  that  it  does  not  scale 
well;  the  odds  of  n  processors  simultaneously  flipping  heads  is  exponentially 
small,  and  because  agreement  is  not  detected  immediately  in  Abrahamson’s 
protocol,  its  worst -case  expected  running  time  is  only  bounded  by  20{~n2\ 

The  first  polynomial-time  consensus  protocol  for  this  model  is  described 
by  Aspnes  and  Herlihy  [AH90a].  The  key  observation,  similar  to  one  made 
by  Chor,  Merritt,  and  Shmoys  [CMS89]  in  the  context  of  a  different  model,  is 
that  the  n  different  local  coin  flips  can  be  replaced  by  a  single  shared  coin 
protocol  that  produces  random  bits  that  all  of  the  processors  agree  on  with 
at  least  a  constant  probability,  regardless  of  the  behavior  of  the  scheduler. 
We  showed  that  it  is  possible  to  construct  a  consensus  protocol  from  any 
shared  coin  protocol  by  running  the  shared  coin  repeatedly  until  agreement 
is  reached.  (A  description  of  the  construction  appears  in  Section  3.3.)  The 
cost  of  the  resulting  consensus  protocol  is  within  a  constant  factor  of  the  cost 
of  the  shared  coin.  Subsequent  work  on  shared-memory  consensus  protocols 
has  concentrated  primarily  on  the  problem  of  constructing  efficient  shared 
coin  protocols. 

All  currently  known  shared  coin  protocols  use  some  form  of  a  very  simple 
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idea.  Each  processor  repeatedly  adds  random  ±1  votes  to  a  common  pool 
until  some  termination  condition  is  reached.  Any  processor  that  sees  a  pos¬ 
itive  total  vote  decides  1,  and  those  that  see  a  negative  total  vote  decide  0. 
Intuitively,  because  all  of  the  processors  are  executing  the  same  loop  over  and 
over  again,  the  adversary’s  power  is  effectively  limited  to  blocking  votes  it 
dislikes  by  stopping  processors  in  between  flipping  their  local  coins  to  decide 
on  the  value  of  the  votes  and  actually  writing  the  votes  out  to  the  registers. 
The  adversary’s  control  is  limited  by  running  the  protocol  for  long  enough 
that  the  sum  of  these  blocked  votes  is  likely  to  be  only  a  fraction  of  the  total 
vote,  a  process  that  requires  accumulating  fi(n2)  votes. 

In  the  original  shared  coin  protocol  of  Aspnes  and  Herlihy  [AH90a],  each 
processor  decides  on  a  value  when  it  sees  a  total  vote  whose  absolute  value 
is  at  least  a  constant  multiple  of  n  from  the  origin.  For  each  of  the  expected 
0(n2)  votes,  @(n2)  register  operations  are  executed,  giving  a  total  running 
time  of  0(n4)  operations.  Unfortunately,  both  the  implementation  of  the 
counter  representing  the  position  of  the  random  walk  and  the  mechanism  for 
repeatedly  running  the  shared  coin  require  a  potentially  unbounded  amount 
of  space.  This  problem  was  corrected  in  a  protocol  of  Attiya,  Dolev,  and 
Shavit  [ADS89],  which  retained  the  multiple  rounds  of  its  predecessor  but 
cleverly  reused  the  space  used  by  old  shared  coins  once  they  were  no  longer 
needed. 

A  simpler  descendent  of  the  shared  coin  protocol  of  Aspnes  and  Herlihy. 
which  also  requires  only  bounded  space,  is  the  shared  coin  protocol  described 
in  Chapter  4.  This  protocol,  by  using  a  more  sophisticated  termination  con¬ 
dition,  guarantees  that  the  processors  always  agree  on  its  outcome.  A  simple 
modification  of  this  protocol  gives  a  consensus  protocol  that  does  not  require 
multiple  executions  of  a  shared  coin;  which  can  be  implemented  using  only 
three  O(log  n)-bit  counters,  supporting  increment,  decrement,  and  read  oper¬ 
ations;  and  which  runs  in  only  0(n2)  expected  counter  operations.  However, 
this  apparent  speed  is  lost  in  the  implementation  of  the  counte-,  because 
0(n2)  register  operations  are  needed  for  each  counter  operation,  giving  it 
the  same  running  time  of  0(n4)  expected  register  operations  as  its  prede¬ 
cessors.  Since  the  consensus  protocol  of  Chapter  4  first  appeared  [Asp90]. 
other  researches  [BR90,  DHPW92]  have  described  weaker  primitives  that 
act  sufficiently  like  counters  to  make  the  protocol  work  and  which  use  only  a 
linear  number  of  register  operations  for  each  counter  operation.  Using  these 
primitives  in  place  of  the  counters  gives  a  consensus  protocol  that  runs  in 
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expected  0(n3)  register  operations. 

An  alternative  to  having  each  processor  finish  the  protocol  when  it  sees  a 
total  vote  far  from  the  origin  is  to  simply  gather  votes  until  some  predeter¬ 
mined  quorum  is  reached.  The  first  shared  coin  protocol  to  use  this  technique 
is  that  of  Saks,  Shavit,  and  Woll  [SSW91].  It  is  still  necessary  to  gather  Q(n2) 
votes  to  overcome  the  effect  of  votes  withheld  by  the  scheduler,  and  in  fact 
the  Saks-Shavit-Woll  protocol  still  requires  0(n4)  register  operations.  Fur¬ 
thermore,  it  is  unlikely  that  any  protocol  that  runs  in  a  fixed  number  of 
total  operations  can  guarantee  that  all  processors  agree  on  the  outcome  of 
the  coin;  thus  it  is  necessary  to  retain  the  complex  multiple  rounds  of  the 
Aspnes-Herlihy  protocol  in  some  form.  However,  stopping  the  protocol  after 
a  specified  number  of  votes  are  collected  has  a  very  important  consequence: 
it  is  no  longer  necessary  for  a  processor  to  check  for  termination  after  every 
vote  it  casts. 

This  remarkable  fact  was  observed  by  Bracha  and  Rachman  [BR91]  and 
is  the  basis  for  their  fast  shared  coin  protocol.  In  this  protocol,  as  in  previous 
shared  coin  protocols,  the  processors  repeatedly  generate  random  ±1  votes 
and  add  them  to  a  running  total.  After  a  quorum  of  0(n2)  votes  are  collected 
processors  may  decide  on  the  output  of  the  shared  coin  based  on  the  sign 
of  the  total  vote.  But  each  processor  only  checks  if  the  quorum  has  been 
reached  after  every  0{n/  log  n)  votes —  so  the  processors  can  generate  an 
additional  0(n2/  logn)  “extra”  votes  beyond  the  “common”  votes  making  up 
the  quorum.  However,  by  making  the  number  of  common  votes  large  enough 
compared  to  the  number  of  extra  votes,  the  probability  that  the  extra  votes 
will  change  the  sign  of  the  total  vote  can  be  made  arbitrarily  small.  Thus, 
even  if  one  processor  reads  the  total  vote  immediately  after  the  quorum  is 
reached  and  another  reads  it  after  many  extra  votes  have  been  cast,  it  is 
still  likely  that  both  will  agree  with  each  other  on  the  outcome  of  the  shared 
coin.  In  addition,  because  each  processor  only  needs  to  compute  the  total 
vote  once,  after  it  has  seen  a  full  quorum,  no  counters  or  other  complicated 
primitives  are  needed  to  keep  track  of  the  voting.  Each  processor  simply 
maintains  in  its  own  register  a  tally  of  ail  the  votes  it  has  cast,  and  computes 
the  total  vote  by  summing  the  tallies  in  all  of  the  registers.  The  result  is  that 
the  protocol  requires  only  0(n2logn)  expected  total  register  operations. 

There  is,  however,  still  room  for  improvement.  All  of  the  shared  coin 
protocols  we  have  described  suffer  from  a  fundamental  flaw:  if  the  scheduler 
stops  all  but  one  of  the  processors,  that  lone  processor  is  still  forced  to 
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generate  f 2(n2)  local  coin  flips.  The  essence  of  wait-freeness  is  bounding  the 
work  done  by  a  single  processor,  despite  the  failures  of  other  processors.  But 
the  bound  on  the  work  done  by  a  single  processor,  in  every  one  of  these 
protocols,  is  asymptotically  no  better  than  the  bound  on  the  work  done  by 
all  of  the  processors  together. 

Chapter  5  shows  that  wait-free  consensus  can  be  achieved  without  forcing 
a  feist  processor  to  do  most  of  the  work.  I  describe  a  shared  coin  protocol 
in  which  the  processors  cast  votes  of  steadily  increasing  weights.  In  effect,  a 
fast  processor  or  a  processor  running  in  isolation  becomes  “impatient”  and 
starts  casting  large  votes  to  finish  the  protocol  more  quickly.  This  mechanism 
does  grant  the  adversary  greater  control,  because  it  can  choose  from  up  to  n 
different  weights  (one  for  each  processor)  when  determining  the  weight  of  the 
next  vote  to  be  cast.  One  effect  of  this  control  is  that  a  more  sophisticated 
analysis  is  required  than  for  the  unweighted- voting  protocols.  Still,  with 
appropriately-chosen  parameters  the  protocol  guarantees  that  each  processor 
finishes  after  only  0(n  log2  n)  expected  operations. 

The  organization  of  the  dissertation  is  as  follows.  Chapters  2  and  3  pro¬ 
vide  a  framework  of  definitions  for  the  material  in  the  later  chapters.  Chapter 
2  describes  the  asynchronous  shared-memory  model  in  detail  and  compares 
it  with  other  models  of  distributed  systems.  Chapter  3  formally  defines  the 
consensus  problem  and  its  relationship  to  the  problem  of  constructing  shared 
coins.  The  main  results  appear  m  Chapters  4  and  5.  Chapter  4  describes  the 
simple  consensus  protocol  based  on  a  random  walk.  Chapter  5  describes  the 
faster  protocol  based  on  weighted  voting.  Finally,  Chapter  6  compares  these 
results  to  other  solutions  to  the  problem  of  wait-free  consensus  and  discusses 
possible  directions  for  future  work. 

Much  of  the  content  of  Chapters  4  and  5  also  appears  in  [Asp90]  and 
[AW92],  respectively.  Some  of  the  material  in  Chapter  3  is  derived  from 
[AH90aj. 
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Chapter  2 

The  Asynchronous 
Shared-Memory  Model 


This  chapter  gives  a  detailed  description  of  the  asynchronous  shared- memory 
model.  This  model  is  the  standard  one  for  analyzing  wait-free  consensus 
protocols  [Abr88,  ADS89,  AH90a,  Asp90,  AW92,  BR90,  BR91,  DHPW92, 
SSW91].  Though  it  appears  in  varying  guises,  all  are  essentially  equivalent. 
The  description  of  the  model  here  largely  follows  that  of  the  “weak  model” 
of  Abrahamson  [Abr88j.  The  reader  interested  in  a  more  formal  definition  of 
the  model  may  find  one  in  [AH90a]  based  on  the  I/O  Automaton  model  of 
Lynch  [Lyn88]. 


2.1  Basic  elements 

The  system  consists  of  a  collection  of  n  processors,  state  machines  whose 
behavior  is  typically  specified  by  a  high-level  protocol.  In  principle  no 
limits  are  assumed  on  the  computational  power  of  the  processors,  although 
in  practice  none  of  the  protocols  described  in  this  document  will  require  much 
local  computation. 

The  processors  cam  communicate  only  by  executing  read  and  write  op¬ 
erations  on  a  collection  of  single-writer,  multi-reader  atomic  registers 
[Lam77,  Lam86b].  Each  of  these  registers  is  associated  with  one  of  the  pro¬ 
cessors,  its  owner.  Only  the  owner  of  a  register  is  allowed  to  write  to  it, 
although  any  of  the  processors  may  read  from  it.  Atomicity  means  that 
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read  and  write  operations  act  as  if  they  take  place  instantaneously:  they 
never  fail,  and  the  result  of  concurrent  execution  of  multiple  operations  on 
the  same  register  is  consistent  with  their  having  occurred  sequentially. 

The  assumptions  behind  atomicity  may  appear  to  be  rather  strong,  espe¬ 
cially  in  a  model  that  is  designed  to  be  as  harsh  as  possible.  However,  it  turns 
out  that  atomic  registers  are  not  powerful  enough  to  implement  determinis¬ 
tically  such  simple  synchronization  primitives  as  queues  or  test-and-set  bits 
[Her91],  and  may  be  constructed  efficiently  from  much  weaker  primitives  in  a 
variety  of  ways  [BP87,  IL87,  NW87,  Pet83,  SAG87].  So  in  fact  the  apparent 
strength  of  atomic  registers  is  somewhat  illusory. 


2.2  Time  and  asynchrony 

The  systems  represented  by  the  model  may  have  many  events  occurring  con¬ 
currently.  However,  because  the  only  communication  between  processors  in 
the  system  is  by  operations  on  atomic  registers,  it  is  possible  to  represent 
its  behavior  using  a  global-time  model  [BD88,  Lam86a,  Lam86b],  Instead 
of  treating  operations  on  the  registers  as  occurring  over  possibly-overlapping 
intervals  of  time,  they  are  treated  as  occurring  instantaneously.  The  history 
of  an  execution  of  the  system  can  thus  be  described  simply  as  a  sequence 
of  operations.  Concurrency  in  the  system  as  a  whole  is  modeled  by  the 
interleaving  of  operations  from  different  processors  in  this  sequence. 

The  actual  order  of  the  interleaving  is  the  primary  source  of  nondetermin¬ 
ism  in  the  system.  At  any  given  time  there  may  be  up  to  n  processors  that  are 
ready  to  execute  another  operation;  how,  then,  does  the  system  choose  which 
of  the  processors  will  run  next?  We  would  like  to  make  as  few  assumptions 
here  as  possible,  so  that  our  protocols  will  work  under  the  widest  possible 
set  of  circumstances.  One  way  of  doing  this  is  to  assign  control  over  timing 
to  an  adversary  scheduler,  a  function  that  chooses  a  processor  to  run  at 
each  step  based  on  the  previous  history  and  current  state  of  the  system.  The 
adversary  scheduler  is  not  bound  by  any  fairness  constraints;  it  may  start 
and  stop  processors  at  will,  doing  whatever  is  necessary  to  prevent  a  protocol 
from  executing  correctly.  In  addition,  no  limits  are  placed  on  the  scheduler’s 
computational  power  or  knowledge  of  the  programming  or  internal  states  of 
the  processors.  However,  its  control  is  limited  only  to  the  timing  of  events 
in  the  system —  it  cannot,  for  example,  cause  a  read  operation  to  return  the 
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wrong  value  or  a  processor  to  deviate  from  its  programming. 

The  definition  of  a  wait-free  protocol  implicitly  depends  on  having  such 
a  powerful  adversary.  A  protocol  is  said  to  be  wait-free  if  every  processor 
finishes  the  protocol  in  a  finite  number  of  its  own  steps,  regardless  of  the 
relative  speeds  of  the  other  processors.  The  adversary  in  the  asynchronous 
shared-memory  model  simply  represents  the  universal  quantifier  hidden  in 
that  condition.  If  we  can  design  a  protocol  that  will  beat  an  all-powerful 
adversary,  we  will  know  that  the  protocol  will  succeed  in  the  far  easier  task 
of  working  correctly  in  whatever  circumstances  chance  and  the  workings  of 
a  real  system  might  throw  at  it. 


2.3  Randomization 

In  order  to  solve  the  consensus  problem  in  the  presence  of  an  adversary 
scheduler,  the  processors  will  need  to  be  able  to  act  nondeterministically.  In 
addition  to  giving  each  processor  the  ability  to  write  to  its  own  registers  and 
to  read  from  any  of  the  registers,  we  will  give  each  processor  the  ability  to 
flip  a  local  coin.  This  operation  provides  the  processor  with  a  random  bit 
that  cannot  be  predicted  by  the  scheduler  in  advance,  though  it  is  known  to 
the  scheduler  immediately  afterwards  by  virtue  of  the  scheduler’s  ability  to 
see  the  internal  states  of  the  processors.  The  timing  of  coin-flip  operations, 
like  that  of  read  and  write  operations,  is  under  the  control  of  the  scheduler. 


2.4  Relation  to  other  models 

There  are  other  models  that  are  closely  related  to  the  asynchronous  shared- 
memory  model.  In  particular  it  is  tempting  to  define  the  property  of  being 
wait-free  as  the  property  that  a  protocol  will  finish  (that  is,  one  processor  will 
finish)  even  in  the  presence  of  up  to  n  -  1  halting  failures,  where  a  halting 
failure  is  an  event  after  which  a  processor  executes  no  more  operations.  Such 
a  definition  would  make  wait-freeness  a  natural  extension  of  /-resilience, 
the  property  of  working  in  the  presence  of  up  to  t  halting  failures. 

However,  in  the  context  of  a  totally  asynchronous  system  this  definition 
is  unnecessarily  restrictive.  It  is  true  that  the  adversary  is  able  to  simulate 
up  to  n  —  1  halting  failures  simply  by  choosing  not  to  run  “halted”  proces- 
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sors  ever  again.  However,  there  is  no  reason  to  believe  that  dead  processors 
are  the  only  source  of  difficulty  in  an  asynchronous  environment.  For  ex¬ 
ample,  the  adversary  could  choose  to  put  some  processor  to  sleep  for  a  very 
long  interval,  waking  it  only  when  its  view  of  the  world  was  so  outdated 
that  its  misguided  actions  would  only  hinder  the  completion  of  a  protocol. 
As  the  hallway  example  in  the  introduction  shows,  stopping  a  processor  and 
reawakening  it  much  later  can  be  even  more  devastating  stopping  a  processor 
forever.  Furthermore,  distinguishing  between  slow  processors  and  dead  ones 
requires  either  an  assumption  that  slow  processors  must  take  a  step  after 
some  bounded  interval,  or  that  fast  processors  may  execute  a  potentially  un¬ 
bounded  number  of  operations  waiting  for  the  slow  processors  to  revive.  The 
first  assumption  imposes  a  weak  form  of  synchrony  on  the  system,  violating 
the  principle  of  avoiding  helpful  assumptions;  the  second  makes  it  difficult 
to  measure  the  efficiency  of  a  protocol.  For  these  reasons  we  avoid  the  issue 
completely  by  using  the  more  general  definition. 

Other  alternatives  to  the  model  involve  changing  the  underlying  commu¬ 
nications  medium  from  atomic  registers,  either  by  adopting  stronger  prim¬ 
itives  that  provide  greater  synchronization,  or  by  moving  to  some  sort  of 
message- passing  model.  We  avoid  the  first  approach  because,  as  always,  we 
would  like  to  work  in  as  weak  a  model  as  possible.  However,  the  question  of 
how  a  different  choice  of  primitives  can  affect  the  difficulty  of  solving  wait- 
free  consensus  is  an  interesting  one  about  which  little  is  known,  except  for 
the  deterministic  case  [LAA87,  Her91]. 

Moving  to  a  message-passing  model  presents  new  difficulties.  In  general, 
the  defining  property  of  a  message- passing  model  is  that  the  processors  com¬ 
municate  by  sending  messages  to  each  other  directly,  rather  than  operating 
on  a  common  pool  of  registers  or  other  primitives.  Message-passing  models 
come  in  bewildering  variety;  a  general  taxonomy  can  be  found  in  [LL90]. 
Dolev  et  al.  [DDS87]  classify  a  large  collection  of  message- passing  models 
and  show  which  are  capable  of  solving  consensus  deterministically. 

Among  these  many  models,  one  has  traditionally  been  associated  with 
solving  asynchronous  consensus  [BND89,  BT83,  CM89,  FLP85].  In  this 
model,  the  adversary  is  allowed  to  (i)  stop  up  to  t  processors  and  (ii)  de¬ 
lay  messages  arbitrarily.  Unfortunately,  a  simple  partition  argument  shows 
that  in  this  model  one  cannot  solve  consensus  even  with  a  randomized  al¬ 
gorithm  if  at  least  n/2  processors  can  fail  [BT83],  Intuitively,  the  adversary 
can  divide  the  processors  into  two  groups  of  size  n/2  and  delay  all  messages 
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passing  between  the  groups.  As  neither  group  will  be  able  to  distinguish 
this  partitioning  from  the  other  group  actually  being  dead,  the  two  groups 
will  independently  come  up  with  decisions  that  may  be  inconsistent.  On  the 
other  hand,  solutions  to  the  wait-free  consensus  problem  for  shared  memory 
can  be  used  to  obtain  solutions  to  consensus  problems  for  message-passing 
models  with  weaker  failure  conditions  by  simulating  the  shared  memory.  An 
example  of  this  technique  may  be  found  in  [BND89].  A  general  comparison 
of  the  power  of  shared- memory  and  message- passing  models  in  the  presence 
of  halting  failures  can  be  found  in  |CM89]. 


2.5  Performance  measures 

It  is  not  immediately  obvious  how  best  to  measure  the  performance  of  a  wait- 
free  decision  protocol.  Two  measures  are  very  natural  for  the  asynchronous 
model,  as  they  impose  no  implicit  assumptions  on  the  scheduling  of  opera¬ 
tions  in  the  system.  These  are  the  total  work  measure,  which  simply  counts 
the  total  number  of  register  operations  executed  by  all  the  processors  together 
until  every  processor  has  finished  the  protocol;  and  the  per-processor  work 
measure,  which  takes  the  maximum  over  all  processors  of  the  number  of  reg¬ 
ister  operations  executed  by  each  processor  individually  before  it  finishes  the 
protocol. 

The  per-processor  measure  is  closer  to  the  spirit  of  the  wait-free  guar¬ 
antee  that  each  processor  finishes  in  a  finite  number  of  its  own  steps,  as  it 
gives  an  upper  bound  on  what  that  finite  number  is.  However,  prior  to  the 
protocol  of  Chapter  5,  for  every  known  consensus  protocol  (see  Table  6.1) 
the  two  measures  were  within  a  constant  factor  of  each  other.  As  a  result 
only  the  total  work  measure  has  typically  been  considered.  This  usage  is  in 
contrast  to  what  is  needed  in  situations  where  processors  may  re-enter  the 
protocol  repeatedly,  as  in  protocols  for  simulating  various  shared  abstract 
data  types  [AAD+90,  AG91,  AH90b,  And90,  DHPW92,  Her91,  Plo89]  or  for 
timestamping  and  similar  mechanisms  [DS89,  DW92,  IL87].  In  these  pro¬ 
tocols  one  is  typically  interested  in  the  number  of  register  operations  each 
processor  must  execute  to  simulate  a  single  operation  on  the  shared  object, 
an  inherently  per-processor  measure,  and  the  total  work  is  meaningful  only 
when  interpreted  in  an  amortized  sense. 

Given  the  usefulness  of  the  per-processor  measure  in  this  broader  con- 
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text,  I  will  concentrate  primarily  on  it.  However,  because  the  total- work 
measure  has  traditionally  been  used  to  analyze  consensus  protocols  it  will  be 
considered  as  well. 

An  alternative  to  these  measures  that  has  seen  some  use  in  analyzing 
wait-free  protocols  is  the  rounds  measure  of  asynchronous  time  (AFL83, 
ALS90,  LF81,  SSW91].  It  is  used  for  models  that  represent  halting  failures 
explicitly.  When  using  this  measure,  up  to  n  —  1  processes  may  be  designated 
as  faulty  at  the  discretion  of  the  adversary;  once  a  processor  becomes  faulty  it 
is  never  allowed  to  execute  another  operation.  A  round  is  a  minimal  interval 
during  which  every  non-faulty  processor  executes  at  least  one  operation.  The 
measure  is  simply  the  number  of  these  rounds.  In  effect,  this  measure  counts 
the  operations  of  the  slowest  non-faulty  processor  at  any  given  point  in  the 
execution.  If  a  slow  processor  executes  only  one  operation  in  a  given  interval, 
only  one  round  has  elapsed,  even  though  a  faster  processor  might  have  carried 
out  hundreds  of  operations  during  the  same  interval. 

The  rounds  measure  is  reasonable  if  one  defines  the  property  of  being 
wait-free  as  equivalent  to  being  able  to  survive  up  to  n  —  1  halting  failures. 
However,  as  explained  above,  in  the  context  of  a  totally  asynchronous  sys¬ 
tem  this  definition  is  unnecessarily  restrictive.  But  once  we  adopt  the  more 
general  definitions  we  quickly  run  into  trouble.  If  some  processor  stops  and 
then  starts  again  much  later  during  the  execution  of  the  protocol,  the  entire 
period  that  the  processor  is  inactive  counts  as  only  one  round.  As  a  result 
the  rounds  measure  implicitly  resolves  the  problem  of  distinguishing  slow 
processors  from  dead  ones  by  guaranteeing  that  processors  will  either  run  at 
bounded  relative  speeds  or  not  run  at  all.  This  is  in  conflict  with  the  goal  of 
using  a  model  that  is  as  general  as  possible,  and  for  this  reason  the  rounds 
measure  will  not  be  used  here.1 


1 A  notion  of  “rounds”  does  appear  in  Section  3.3;  these  rounds  are  part  of  the  internal 
structure  of  the  protocol  described  there  and  have  no  relation  to  the  rounds  measure. 


13 


Chapter  3 

Consensus  and  Shared  Coins 


This  chapter  formally  describes  the  problem  of  solving  consensus  and  the 
closely- related  problem  of  construction  a  shared  coin,  and  gives  an  example 
of  a  method  for  solving  consensus  using  a  shared  coin.  This  last  technique 
will  be  of  particular  importance  in  Chapter  5. 


3.1  Consensus 

Consensus  is  a  decision  problem  in  which  n  processors,  each  starting  with 
a  value  (0  or  1)  not  known  to  the  others,  must  collectively  agree  on  a  single 
value.  A  consensus  protocol  is  a  distributed  protocol  for  solving  consensus. 
It  is  correct  if  it  meets  the  following  conditions  [CIL87]: 

•  Consistency.  All  processors  decide  on  the  same  value. 

•  Termination.  Every  processor  decides  on  some  value  in  finite  expected 
time. 

•  Validity.  If  every  processor  starts  with  the  same  value,  every  processor 
decides  on  that  value. 

The  basic  idea  behind  consensus  is  to  allow  the  processors  to  make  a 
collective  decision.  For  this  purpose,  the  consistency  condition  is  the  most 
fundamental  of  the  correctness  conditions,  as  it  is  what  actually  guarantees 
that  the  processors  agree.  The  termination  condition  is  phrased  to  apply  in 
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many  possible  models;  in  the  asynchronous  shared-memory  model  it  trans¬ 
lates  into  requiring  that  the  protocol  be  wait-free,  as  it  requires  that  pro¬ 
cessors  must  finish  in  finite  expected  time  regardless  of  the  actions  of  the 
adversary  scheduler. 

If  it  happens  that  the  processors  already  agree  with  each  other,  we  want 
the  consensus  protocol  to  ratify  that  agreement  rather  than  veto  it;  hence  the 
validity  condition.  From  a  less  practical  perspective  the  validity  condition 
is  needed  because  its  absence  makes  the  problem  uninteresting,  since  all  of 
the  processors  could  just  decide  0  every  time  the  protocol  is  run  without  any 
communication  at  all. 

If  we  are  allowed  to  make  convenient  assumptions  about  the  system,  con¬ 
sensus  is  not  a  difficult  problem.  For  example,  on  a  PRAM  (perhaps  the 
friendliest  cousin  of  asynchronous  shared-memory)  consensus  reduces  to  sim¬ 
ply  taking  any  function  we  like  of  the  input  values  that  satisfies  the  validity 
condition.  In  general,  in  any  model  where  both  the  processors  and  the  com¬ 
munications  medium  are  reliable  the  problem  can  be  solved  simply  by  having 
the  processors  exchange  information  about  their  inputs  until  all  of  them  know 
the  entire  set  of  inputs;  at  this  point  each  can  individually  compute  a  func¬ 
tion  of  the  inputs  as  in  the  PRAM  case  to  come  up  with  the  decision  value 
for  the  protocol.  It  is  only  when  we  move  to  a  model,  like  asynchronous 
shared- memory,  that  allows  processors  to  fail  that  consensus  becomes  hard. 

One  difficulty  is  that  the  harsh  assumptions  of  the  asynchronous  shared- 
memory  model  can  amplify  the  correctness  conditions  in  ways  that  may  not 
be  immediately  obvious.  For  example,  the  validity  conditions  implies  that 
the  adversary  can  always  force  the  processors  to  decide  on  a  particular  value 
by  running  only  those  processors  that  started  with  that  value.  Because  these 
“live”  processors  are  unable  to  see  the  differing  input  values  of  the  “dead" 
processors,  they  will  see  a  situation  indistinguishable  from  one  in  which  ev¬ 
ery  processor  started  with  the  same  value.  In  this  latter  case,  the  validity 
condition  would  force  the  processors  to  decide  on  that  common  value.  So 
because  of  their  limited  knowledge,  the  live  processors  must  decide  on  the 
only  input  value  they  can  see,  even  though  there  may  be  other  processors 
that  disagree  with  it.  This  example  shows  that  one  must  be  very  careful 
about  what  assumptions  one  makes  in  the  model,  a s  they  can  subtly  affect 
what  a  protocol  is  allowed  to  do. 
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3.2  Shared  coins 


In  order  to  solve  the  consensus  problem  we  will  need  to  cope  with  the  con¬ 
siderable  power  of  the  adversary.  We  cannot  modify  the  model  to  place 
restrictions  on  the  adversary;  instead,  we  must  find  some  way  of  getting  the 
processors  to  reach  agreement  in  spite  of  the  adversary’s  interference. 

One  way  is  to  base  a  consensus  protocol  on  a  stronger  primitive,  the 
shared  coin.  A  shared  coin  is  a  decision  protocol  in  which  each  processor 
decides  on  a  bit,  which  with  some  probability  6  will  be  the  same  value  that 
every  other  processor  decides  on.  But  unlike  consensus,  the  actual  value 
chosen  will  not  always  be  under  the  control  of  the  adversary.  In  order  to 
prevent  this  control,  given  the  adversary’s  ability  to  run  only  some  of  the 
processors,  we  must  drop  the  validity  condition  and  with  it  the  notion  of 
input  bits.  What  we  are  left  with  is  the  following  definition. 

A  shared  coin  protocol  with  agreement  parameter1  8  is  a  distributed 
decision  protocol  that  satisfies  these  two  conditions: 

•  Termination.  Every  processor  decides  on  some  value  in  finite  expected 
time. 

•  Probabilistic  agreement.  For  each  value  b  (0  or  1),  the  probability 
that  every  process  decides  on  b  is  at  least  6. 

The  probabilistic  agreement  condition  guarantees  that  with  probability 
26  the  outcome  of  the  shared  coin  protocol  is  agreed  on  by  all  processors 
and  is  indistinguishable  from  flipping  a  fair  coin.  With  probability  1  —  2d. 
no  guarantees  whatsoever  are  made;  it  is  possible  that  the  processors  will 
not  agree  with  each  other  at  all,  or  that  the  adversary  will  be  able  to  choose 
what  value  each  processor  decides  on.  Some  sort  of  adversary  control  is 
always  possible,  as  it  is  known  that  a  wait-free  shared  coin  with  8  exactly 
equal  to  1/2  is  impossible  [AH90a]. 

The  agreement  parameter  is  not  the  only  possible  parameter  for  shared 
coin,  merely  the  one  that  is  most  convenient  when  building  consensus  pro¬ 
tocols.  If  we  wish  to  use  the  coin  directly  (for  example,  as  a  source  of  semi¬ 
random  bits  [SV66]  in  a  distributed  algorithm)  a  more  natural  parameter 

'Called  the  defiance  parameter  in  [AH90aj.  The  less  melodramatic  term  agreement 
parameter  is  taken  from  [SSW91]. 
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is  the  bias,  e,  defined  by  e  =  1/2  —  8.  In  terms  of  the  bias  the  agreement 
property  can  be  restated  as  follows: 

•  Bounded  bias.  The  probability  that  at  least  one  processor  decides  on 
a  given  value  is  at  most  1/2  +  e. 

This  property  says  in  effect,  that  the  adversary  can  force  some  processor  to 
see  a  particular  outcome  with  only  t  greater  probability  than  if  the  processors 
were  actually  collectively  flipping  a  fair  coin. 

In  some  circumstances  we  would  like  to  guarantee  that  all  of  the  pro¬ 
cessors  always  agree  on  the  outcome  of  the  coin,  even  though  the  adversary 
might  have  been  able  to  control  what  that  outcome  is.  A  shared  coin  that 
guarantees  agreement  will  be  called  robust.  As  will  be  seen  in  Chapter  4. 
robust  shared  coins  can  often  bo  converted  directly  into  consensus  protocols 
by  the  addition  of  only  a  small  amount  of  machinery.  However  Chapter  5 
describes  an  intrinsically  non-robust  shared  coin;  in  this  situation  more  so¬ 
phisticated  techniques  are  needed  to  achieve  consensus.  One  approach  is 
described  in  the  next  section. 


3.3  Consensus  using  shared  coins 

It  is  a  well-established  result  that  one  can  construct  a  consensus  protocol  from 
a  shared  coin  with  constant  agreement  parameter  [ADS89,  AH90a,  SSW91]. 
This  section  gives  as  an  example  the  first  of  these  constructions  [AH90aj.  As 
we  shall  see,  this  construction  gives  a  consensus  protocol  which  requires  an 
expected  0((T{n)  +  n)/8)  operations  per  processor  and  0((T'(n)  +  n2)/8) 
total  operations,  where  T{n)  and  T'{n)  are  the  expected  number  ot  operations 
per  processor  and  total  operations  for  the  shared  coin  protocol. 

Pseudocode  for  each  processor’s  behavior  in  the  shared-coin- based  con¬ 
sensus  protocol  is  given  in  Figure  .'{.I.  Each  processor  has  a  register  of  its  own 
with  two  fields:  prefer  and  round ,  initialized  to  (-L,0).  In  addition  there  are 
assumed  to  be  a  (potentially  unbounded)  collection  of  shared  coin  primitives, 
one  for  each  "‘round”  of  the  protocol.  Two  special  terms  are  used  to  simplify 
the  description  of  the  protocol.  A  processor  is  a  leader  if  its  round  field  is 
greater  than  or  equal  to  every  other  process’s  round  field.  Two  processors 
agree  if  both  their  prefer  fields  are  equal,  and  neither  field  is  _L. 
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1  procedure  consensus(input) 

2  ( prefer ,  round)  «—  (input  ,  I ) 

3  repeat 

4  read  all  the  registers 

5  if  all  who  disagree  trail  by  2  and  I’m  a  leader  then 

6  output  prefer 

7  else  if  leaders  agree  then 

8  ( prefer ,  round)  <—  (leader  preference,  round  +  1) 

9  else  if  prefer  ^  J_  then 

10  ( prefer ,  round  <—  (J_,  round) 

11  .  else 

12  ( prefer ,  round)  «—  (shared^coin [round],  round  +  1) 


Figure  3.1:  Consensus  from  a  shared  coin.  • 

Let  us  sketch  out  the  workings  of  the  protocol.  The  most  serious  problem 
that  the  protocol  is  designed  to  solve  is  how  to  neutralize  “slow”  processors 
that  have  old,  out-of-date  views  of  the  world.  Because  such  processors  end 
up  with  low  round  values  relative  to  the  “fast”  processors,  they  are  effectively 
excluded  from  the  real  decision-making  in  the  protocol  until  they  manage  to 
catch  up  to  their  faster  comrades. 

Intuitively,  the  decision-making  process  consists  of  the  leaders  running 
the  shared  coin  protocol  in  line  12.  It  is  not  necessarily  the  case  that  all  of 
the  leaders  at  each  round  will  take  part  in  the  shared  coin  protocol,  as  those 
that  arrive  earliest  may  not  see  disagreement  and  will  execute  line  8  instead. 
However,  those  early  arrivals  must  in  fact  agree  with  each  other,  and  so  with 
probability  at  least  S  the  others  will  switch  to  agree  with  them  at  any  given 
round.  It  follows  that  the  expected  number  of  rounds  until  agreement  is 
0(1/6). 

Once  the  leaders  agree,  tin*  slower  processors  are  forced  to  adopt  the 
leaders’  position  by  executing  line  8.  The  protocol  terminates  when  the 
agreeing  processors  advance  far  enough  (2  rounds)  to  know  that  any  processor 
that  disagrees  will  pass  through  line  8  before  catching  up  and  becoming  a 
leader  itself. 

This  ^  planation  is  informal,  and  glosses  over  many  important  but  tedious 
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details  of  the  protocol.  The  interested  reader  is  referred  to  [AH90a]  for  a  more 
thorough  description  of  the  construction  including  a  full  proof  of  correctness. 
Alternative  constructions  with  similar  performance  may  be  found  in  [ADS89] 
and  [SSW91]. 

For  our  purposes  it  will  sudice  to  summarize  the  relevant  results  from 
[AH90a]: 

Theorem  3.1  ([AH90a])  The  protocol  of  Figure  3.1  implements  a  consen¬ 
sus  protocol  that  requires  an  crpcctcd  0(1/6)  rounds,  where  6  is  the  agreement 
parameter  of  the  shared  coin. 

From  which  it  follows  that: 

Corollary  3.2  The  protocol  of  Figure  3.1  implements  a  consensus  proto¬ 
col  that  requires  an  expected  0((T(n)  +  n)/8)  operations  per  processor  and 
0((T'(n)  +  n2)/8)  operations  in  total;  where  6  is  the  agreement  parameter, 
T(n)  the  expected  number  of  operations  per  processor,  and  T'(n)  the  expected 
number  of  operations  in  total  for  the  shared  coin. 

Proof:  From  the  theorem,  we  expect  at  most  0(1/8)  rounds. 

In  each  round,  each  processor  executes  at  most  2n  read  operations,  one 
instance  of  the  shared  coin,  and  two  write  operations,  for  a  total  of  2n  +  2  + 
T(n)  operations. 

Similarly,  in  each  round  the  processors  collectively  execute  at  most  2 n2 
read  operations,  2n  write  operations,  and  one  instance  of  the  shared  coin,  for 
a  total  of  2n2  +  2n  +  T'(n)  operations.  I 
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Chapter  4 


Consensus  Using  a  Random 
Walk 


All  currently- known  wait-five  consensus  protocols  that  run  in  polynomial 
time  axe  based  on  some  form  of  a  shared  coin  protocol.  The  key  insight 
used  in  constructing  shared  coins  is  that  it  is  dangerous  to  give  too  much 
power  over  the  outcome  of  the  protocol  to  any  one  processor  at  any  given 
time.  Such  tyranny,  like  all  tyrannies,  runs  the  risk  of  a  sudden  change 
in  policy  following  an  assassination,  and  thereby  gives  control  over  policy  to 
potential  assassins  like  our  adversary  scheduler.  In  all  currently  known  shared 
coin  protocols  each  individual  processor’s  power  is  minimized  by  having  the 
processors  repeatedly  cast  small  random  votes  for  the  two  decision  values. 

How  this  voting  process  is  best  represented  depends  on  the  method  used 
to  decide  when  it  is  finished.  In  this  chapter  we  describe  a  shared  coin  and 
a  consensus  protocol  in  which  the  voting  ends  when  the  difference  between 
the  number  of  1  votes  and  0  votes  is  large.  Under  such  circumstances  the 
voting  process  can  be  viewed  as  a  random  walk  in  which  each  vote  moves 
the  total  up  or  down  by  one  until  an  absorbing  barrier  at  ±/\  is  reached 
(where  K  is  a  parameter  of  tin*  protocol).  In  fact,  the  original  polynomial¬ 
time  shared  coin  of  [AH90a]  worked  on  exactly  this  principle.  Unfortunately, 
a  simple  implementation  of  a  random  walk  does  not  guarantee  agreement, 
as  the  adversary  can  allow  one  processor  to  see  a  total  greater  than  K  and 
decide  1,  and  then,  by  releasing  negative  votes  “trapped”  inside  stopped 
processors,  move  the  total  down  out  of  the  decision  range  so  that  with  some 
nonzero  probability  the  other  processors  will  eventually  move  it  to  —  A'  and 
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decide  0.  So  to  use  a  simple  randoin-walk-based  shared  coin  in  a  consensus 
protocol  one  would  need  to  run  it  repeatedly  as  described  in  Section  3.3. 

The  protocols  described  in  this  chapter  avoid  the  need  for  such  methods 
by  extending  the  random  walk  to  incorporate  the  function  of  detecting  agree¬ 
ment.  As  a  result  we  obtain  a  robust  shared  coin,  described  in  Section  4.2. 
which  guarantees  that  all  processors  agree  on  its  outcome.  Because  the  coin 
guarantees  agreement,  it  can  be  modified  in  to  obtain  a  consensus  protocol 
simply  by  attaching  a  preamble  to  ensure  validity,  as  described  in  Section  4.4. 
The  resulting  consensus  protocol  (and  its  variants,  obtained  by  replacing  the 
counter  implementation  (BRDO,  OMPW92])  are  particularly  simple,  as  they 
are  the  only  known  wait-free  consensus  protocols  that  do  not  require  the  re¬ 
peated  execution  of  a  non-robnst.  shared  coin  protocol  and  the  multi-round 
superstructure  that  comes  with  it. 

The  simplicity  of  the  protocol  also  allows  some  optimizations  that  are 
more  difficult  when  using  a  noil-robust  coin.  The  consensus  protocol  is  de¬ 
signed  to  require  fewer  total  operations  if  fewer  processors  actually  partici¬ 
pate  in  it,  a  feature  which  becomes  important  when,  for  example,  the  protocol 
is  used  as  a  primitive  for  building  .shared  data  structures  which  only  a  few 
processors  might  attempt  to  access  simultaneously. 

The  chapter  is  organized  as  follows.  Section  4.1  describes  some  properties 
of  random  walks  that  will  lx?  used  later  in  the  chapter.  Section  4.2  describes 
the  robust  shared  coin  protocol  and  proves  its  correctness.  The  description  of 
the  robust  shared  coin  protocol  assumes  the  presence  of  an  atomic  counter, 
providing  increment,  decrement,  and  read  operations  that  appear  to  occur 
sequentially;  Section  4.3  shows  how  such  a  counter  may  be  built  from  single¬ 
writer  atomic  registers  at  the  cost  of  0{n2)  register  operations  per  counter 
operation.  Finally,  Section  4.4  (Inscribes  the  consensus  protocol  obtained  by 
modifying  the  robust  shared  coin. 


4.1  Random  walks 

Let  us  begin  by  stating  a  few  basic  lemmas  about  the  behavior  of  random 
walks. 

Lemma  4.1  Consider  a  symmetric  random  walk  with  step  size  1  running 
between  absorbing  banders  at  a  and  b  and  starting  at  x,  where  a  <  x  <  b. 
Then: 
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1.  The  expected  number  of  steps  until  one  of  the  barriers  is  reached  is 

given  by  ( x  —  a)(b—  x),  which  is  always  less  than  or  equal  to  • 

2.  The  probability  that  the  random  walk  hits  b  before  a  is 


Proof:  The  random  walk  described  is  just  a  form  of  the  classical  gambler’s 
ruin  problem.  See  [Fel68,  pp.  344-349] .  I 


Lemma  4.2  Consider  a  symmetric  random  walk  with  step  size  1  running 
between  a  reflecting  barrier  at  a  and  an  absorbing  barrier  at  b,  starting  at 
position  x,  a  <  x  <  b.  Then  the  expected  time  until  b  is  reached  is  (x  —  (a  — 

(■ »-«)))(»-*)  <(4-« I1 

Proof:  This  random  walk  can  be  obtained  from  the  random  walk  with  ab¬ 
sorbing  barriers  at  b  and  a  ~  (b  —  a)  by  the  transformation  x  >-+  a  4*  |x  —  a|. 

I 

The  following  critical  lemma  describes  a  modified  random  walk  that  will 
be  of  great  importance  in  analyzing  the  shared  coin  and  consensus  protocols: 

Lemma  4.3  Consider  a  symmetric  random  walk  with  absorbing  barriers  at 
a  and  b  with  the  following  twist:  a  point  c,  a  <  c  <  b  is  chosen  as  the  center 
of  the  random  walk.  The  adversary  chooses  the  starting  position  x  of  the 
random  walk  to  be  anywhere  in  the  range  from  a  to  b.  Also,  before  each  step, 
the  adversary  may  choose  between  moving  randomly  in  either  direction  with 
probability  1/2,  or  moving  »way  from  c  with  probability  1.  No  matter  what 
choices  the  adversary  makes,  the  expected  number  of  steps  until  one  of  the 
barriers  is  reached  is  at  most  (b  —  a)2 

Proof:  The  game  described  can  be  thought  of  as  a  controlled  Markov 
process  [DY75]  in  which  the  adversary  is  trying  to  maximize  the  expected 
time.  Because  this  process  is  played  over  a  finite  set  of  states,  a  standard 
result  of  the  theory  of  controlled  Markov  processes  can  be  applied.  This 
result  states  that  the  maximum  time  can  be  achieved  by  an  adversary  using 
a  simple  strategy,  one  which  chooses  the  same  option  from  each  state  at  all 
times. 

Such  a  strategy  can  be  specified  by  listing  the  points  where  the  adversary 
chooses  to  force  the  particle  to  move  away  from  c.  We  can  think  of  these 
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points  as  dividing  the  range  of  the  random  walk  into  intervals;  between  each 
pair  of  points  where  the  adversary  forces  the  particle  to  move  determinis¬ 
tically  is  a  region  where  the  particle  moves  randomly.  The  points  at  the 
edge  of  these  random  regions  act  like  barriers  in  a  random  walk.  A  point  on 
the  side  away  from  c  pushes  the  particle  into  a  new  region  and  so  acts  like 
an  absorbing  barrier,  while  a  point  on  the  side  toward  c  pushes  the  particle 
back  into  the  old  region  and  so  acts  like  a  reflecting  barrier.  Thus  the  region 
containing  c  acts  like  a  random  walk  with  two  absorbing  barriers,  and  the 
remaining  regions  act  like  random  walks  with  one  absorbing  barrier  (on  the 
side  away  from  c)  and  one  reflecting  barrier  (on  the  side  toward  c). 

Because  each  barrier  can  only  be  crossed  away  from  c,  once  the  particle 
leaves  a  region  it  can  never  return.  Now,  suppose  the  particle  starts  in  a 
region  with  width  w\.  After  at  most  w\  steps  on  average  (by  Lemmas  4.1  or 
4.2)  it  will  pass  into  a  new  region  of  width  u^;  after  an  additional  w\  steps 
it  will  pass  into  a  new  region  of  width  uj3,  and  so  on  until  either  a  or  b  is 
reached.  Since  these  regions  all  fit  between  a  and  b,  YLwi  <  &  —  <*,  and  thus 
(since  each  w,  >  0)  £  w2  <  (b  —  a)2.  I 

Though  the  bound  in  Lemma  4.3  is  proved  for  the  case  of  a  very  powerful 
adversary  that  is  always  allowed  to  choose  between  a  random  move  and  a 
deterministic  move  at  each  step,  the  bound  applies  equally  well  to  a  weaker 
adversary  whose  choices  are  more  constrained,  as  the  stronger  adversary 
could  always  choose  to  operate  within  the  weaker  adversary’s  constraints. 
This  technique,  of  proving  bounds  for  a  strong  adversary  that  carry  over  to 
a  weaker  one,  has  great  simplifying  power.  It  will  be  used  extensively  in  the 
analysis  of  the  shared  coin  and  consensus  protocols. 


4.2  The  robust  shared  coin  protocol 

Figure  4.1  shows  pseudocode  for  each  processor’s  behavior  in  the  robust 
shared  coin  protocol.  The  coin  is  constructed  using  an  atomic  counter, 
which  supports  atomic  increment,  decrement,  and  read  operations.  In  this 
section,  these  operations  are  assumed  to  take  unit  time.  The  counter  is 
initialized  to  0.  The  processor’s  local  coin  is  represented  by  the  procedure 
local-flip,  which  returns  the  values  —1  and  1  with  equal  probability. 

A  processor’s  behavior  in  the  protocol  is  represented  in  pictorial  form  in 
Figure  4.2.  While  a  processor  reads  values  in  the  central  range  from  -K 
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Shared  data: 

counter  counter  with  range  [— K  —  3n,K  +  3n],  initialized  to  0 

1  procedure  shared.coin( ) 

2  repeat 

3  c  *—  counter 

4  if  c  <  —  (K  +  n)  then  output  0 

5  else  if  c  >  ( K  +  n)  then  output  1 

6  else  if  c  <  —K  then  decrement  counter 

7  else  if  c  >  K  then  increment  counter 

8  else 

9  if  locaLflipQ  =  1  then  increment  counter 

10  else  decrement  counter 

Figure  4.1:  Robust  shared  coin  protocol. 


—K  —  n  +I<  +  n 


Figure  4.2:  Pictorial  representation  of  robust  shared  coin  protocol. 
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to  K  (where  A  is  a  parameter  of  the  protocol)  it  flips  a  local  fair  coin  to 
decide  whether  to  increment  or  decrement  the  counter.  This  part  of  the 
protocol  is  essentially  the  same  as  the  random-walk-based  shared  coin  of 
Aspnes  and  Herlihy  [AH90aj.  What  is  new  is  the  addition  of  a  “slope”  at 
either  side  of  the  random  walk.  On  these  slopes,  a  processor  does  not  move 
the  counter  randomly  but  instead  always  moves  it  away  from  the  center. 
When  a  processor  reads  a  counter  value  in  one  of  the  “buckets”  beyond  the 
slopes,  it  decides  either  0  or  1  depending  on  the  sign  of  the  counter. 

If  the  slopes  are  wide  enough,  once  any  processor  has  seen  a  value  that 
causes  it  to  decide,  all  other  processors  will  see  values  that  cause  them  to 
push  the  counter  toward  the  same  decision.  This  mechanism  eliminates  the 
possibility  that  delayed  writes  might  move  the  counter  out  of  the  decision 
range  and  allow  the  random  walk  (with  small  but  non-negligible  probability) 
to  wander  over  to  the  other  side.  More  formally,  we  can  show: 

Lemma  4.4  If  any  processor  reads  a  counter  value  v  >  (A  +  n),  then  all 
subsequent  reads  will  return  values  greater  than  or  equal  to  K  +  1;  in  the 
symmetric  case  where  v  <  —(K  +  n),  all  subsequent  values  read  will  be  less 
than  or  equal  to  —  (A  +  1). 

Proof:  Suppose  that  a  processor  has  read  v  >  (K  +  n);  then  it  immediately 
terminates  leaving  n  —  1  running  processors.  Thus  the  number  d  of  processors 
that  will  execute  a  decrement  before  their  next  read  is  at  most  n  —  1.  Let 
/  =  c  —  d  where  c  is  the  value  stored  in  the  counter.  Since  c  >  (K  +  n),  it 
must  be  the  case  that  /  >  K  +  1.  Now  consider  the  effect  of  the  actions  the 
scheduler  can  take.  If  it  allows  a  decrement  to  proceed,  c  and  d  both  drop 
by  1  and  /  remains  constant.  If  it  allows  an  increment  to  occur,  c  increases 
and  l  increases  with  it.  If  it  allows  a  read,  the  value  read  is  c  >  /  >  A'  +  1. 
and  thus  d  is  unaffected.  In  each  case  l  remains  at  least  A'  -I- 1,  and  the  claim 
follows  since  c>  l.  The  proof  of  the  symmetric  case  is  similar.  I 

The  ’consistency  property  follows  immediately  from  Lemma  4.4.  A  similar 
argument  shows  that  the  counter  will  not  overflow: 

Lemma  4.5  The  counter  value  never  leaves  the  range  [I\  —  3 n,  K  +  3n]  in 
any  execution  of  the  shared  coin  protocol. 

Proof:  Suppose  that  the  counter  reaches  K  4-  2 n  at  some  point.  Then  each 
processor  will  execute  at  most  one  increment  or  decrement  operation  before 
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it  reads  the  counter,  at  which  point  it  will  decide  1  and  execute  no  additional 
operations.  Thus  the  counter  cannot  exceed  K  +  2n  +  n  =s  K  +  3n.  The  full 
result  follows  by  symmetry.  I 

Proving  the  termination  and  bounded  bias  properties  of  the  shared  coin 
requires  some  additional  machinery.  Define  the  true  position  t  of  the  ran¬ 
dom  walk  to  be  the  value  in  the  counter,  plus  1  for  each  processor  that  will 
increment  the  counter  before  its  next  read,  and  minus  1  for  each  processor 
that  will  decrement  the  counter  before  its  next  read.  The  following  Lemma 
relates  the  value  read  by  a  processor  to  the  true  position  of  the  random  walk: 

Lemma  4.6  Let  c  be  a  value  read  from  the  counter  by  some  processor  and 
t  the  true  position  of  the  random  walk  in  the  state  preceding  the  read.  Then 
\c  —  t\  <  n  —  1. 

Proof:  There  can  be  at  most  n  —  1  processors  with  pending  increments  or 
decrements.  I 

Let  us  assume  hereafter  that  the  scheduler  can  cause  a  processor  to  read 
any  value  between  t  —  {n  —  1)  and  t  +  (n  —  1).  Because  such  a  scheduler  could 
always  choose  to  simulate  any  scheduler  the  protocol  will  actually  face,  any 
“good”  statement  we  can  prove  with  the  assumption  will  carry  over  to  the 
situation  without  it.  The  advantage  of  granting  the  adversary  this  additional 
power  is  that  it  allows  us  to  forget  about  the  vagaries  of  the  counter  value. 
Instead  we  can  treat  the  protocol  as  a  controlled  random  walk  using  the  true 
position  t. 

Consider  the  lower  part  of  Figure  4.3  (the  upper  part  simply  repeats 
Figure  4.2  without  the  buckets.)  If  the  true  position  t  is  in  the  central  region 
between  —K  +  (n  —  1)  and  K  —  (n  —  1),  then  Lemma  4.6  implies  that  any 
processor  that  reads  the  counter  will  see  a  value  between  —K  and  A'  and 
move  t  randomly.  In  the  two  immediately  adjacent  regions,  any  processor 
wijl  either  read  a  value  between  -K  and  K,  and  move  t  randomly,  or  read  a 
value  that  causes  it  to  move  t  away  from  0.  Finally,  any  processor  that  reads 
a  value  in  the  outermost  regions  where  |t|  >  K  +  (n  —  1)  will  either  make  a 
decision  or  move  t  away  from  0.  In  each  of  these  cases,  the  scheduler  is  never 
allowed  to  force  that  true  position  toward  0;  and  if  K  is  large  relative  to  n 
much  of  the  execution  of  the  protocol  will  be  spent  in  the  central  region  where 
the  scheduler’s  control  is  ineffective.  These  two  properties  of  the  protocol  are 
the  basis  of  the  proof  of  its  termination  and  bounded  bias,  as  shown  below. 
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Figure  4.3:  The  protocol  as  a  controlled  random  walk. 

Lemrpa  4.7  The  robust  shared  coin  protocol  executes  an  expected  0({K  + 
n)2)  total  counter  operations  when  K  >  n. 

Proof:  If  we  consider  the  true  position  t,  Lemma  4.6  implies  that  the  sched¬ 
uler  can  only  force  t  up  if  t  >  K  —  (n  —  1)  >  1  and  down  if  t  <  —  I\+(n  —  1)  < 
— 1.  Furthermore  if  |t|  ever  exceeds  K  +  n  +  (n  —  1),  each  processor  will 
decide  after  its  next  read.  Thus  the  movement  of  the  true  position  is  a 
controlled  random  walk  in  the  sense  of  Lemma  4.3  with  center  0  and  bar¬ 
riers  at  dt(K  +  In  —  l).  The  expected  number  of  steps  until  a  barrier  is 
reached  is  at  most  4(A'  +  2n  —  l)2  steps,  which  will  be  followed  by  at  most 
2n  operations  as  the  processors  each  decide.  Since  each  step  takes  a  constant 
number  of  counter  operations  the  expected  number  of  operations  required  is 
0((tf  +  n)2).  I 

The  time  bound  of  Lemma  4.7  shows  that  every  processor  terminates  in 
finite  expected  time  when  K  >  n.  The  bounded  bias  property  is  a  conse¬ 
quence  of  the  following  lemma: 

Lemma  4.8  Against  any  scheduler,  the  probability  that  the  processors  in  the 
robust  shared  coin  protocol  will  decide  1  is  between  and  . 
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Proof:  Suppose  the  scheduler  is  trying  to  maximize  the  probability  of  decid¬ 
ing  1.  Under  the  simplifying  assumption  it  can  force  a  decision  of  1  as  soon 
as  t  =  K  —  (n  —  1);  however,  if  it  allows  t  to  slip  below  —K  —  (n  —  1)  the 
processors  will  eventually  decide  0.  When  —K  —  (n  —  1)  <  t  <  K  —  (n  —  1) 
the  scheduler  may  choose  between  moving  t  randomly  or  forcing  t  toward 
—K  —  (n  —  1).  Clearly,  forcing  the  counter  toward  —K  —  (n  —  1)  can  only 
increase  the  probability  of  deciding  0,  so  choosing  to  move  t  randomly  max¬ 
imizes  the  probability  of  deciding  l.  But  if  the  scheduler  makes  this  choice, 
the  movement  of  the  true  position  becomes  a  simple  random  walk  with  ab¬ 
sorbing  barriers  at  —K  —  (n  —  1)  and  K  —  (n  —  1).  By  Lemma  4.1,  the 
probability  that  t  reaches  K  —  (n  —  1)  first  is  — +2^-— ^ .  The  lower  bound 
follows  by  symmetry.  I 

Combining  the  lemmas  we  obtain:  ' 


Theorem  4.9  When  K  >  n,  the  protocol  of  Figure  4-1  implements  a  robust 
shared  coin. 

Proof:  Consistency  follows  from  Lemma  4.4,  termination  from  Lemma  4.7. 
and  bounded  bias  from  Lemma  4.8.  I 


Lemma  4.8  allows  K  to  be  chosen  to  obtain  arbitrarily  small  non-negative 
bias.  Let  the  bias  of  the  shared  coin  be  ~  +  e,  then 


e  < 


n  -  1 

2 1< 


which  gives 


Combining  this  inequality  with  Lemma  4.7  gives  a  bound  on  the  worst- 
case  expected  running  time  for  the  protocol  of  0{(n/t )2)  total  counter  oper¬ 
ations.  This  time  fs  comparable  to  the  worst-case  expected  running  times  of 
the  protocol’s  non-robust  ancestors.  The  protocol  thus  achieves  robustness 
without  paying  a  significant  cost  in  speed. 
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4.3  Implementing  a  bounded  counter  with 
atomic  registers 

The  robust  shared  coin  protocol  assumes  the  presence  of  a  shared  counter 
supporting  atomic  increment,  decrement,  and  read  operations,  with  the  re¬ 
striction  that  no  operation  will  be  applied  that  will  move  the  counter  out  of 
some  fixed  range  [— r,r].  In  practice  such  a  counter  is  not  likely  to  be  avail¬ 
able  as  a  hardware  primitive.  Fortunately  it  is  not  difficult  to  implement  a 
shared  counter  using  atomic  registers.  However,  some  care  must  be  taken  to 
guarantee  that  the  counter  uses  only  a  bounded  amount  of  space. 

Both  Aspnes  and  Herlihy  [A  1190a]  and  Attiya,  Doiev,  and  Shavit  [ADSS9] 
describe  shared  counter  implementations.  The  two  counter  implementations 
both  assign  a  register  to  hold  the  net  increment  due  to  each  processor,  so 
that  the  counter’s  value  is  simply  the  sum  of  the  values  in  these  registers. 
Both  algorithms  use  simple  atomic  snapshot  protocols  to  allow  the  entire  set 
of  registers  to  be  read  in  a  single  atomic  action. 

Alas,  neither  implementation  does  quite  what  we  would  like.  Even  though 
the  value  stored  in  the  counter  will  never  exceed  the  range  [ — r,  rj,  the  net 
increment  due  to  an  individual  processor  is  potentially  unbounded.  The 
Aspnes-Herlihy  protocol  ignores  this  difficulty  by  assuming  the  presence  of 
unbounded  registers  (which  it  also  uses  to  implement  the  atomic  scan.)  The 
Attiya-Dolev-Shavit  protocol  uses  only  bounded  registers,  but  enforces  the 
bounds  by  prematurely  terminating  the  shared  coin  protocol  if  any  proces¬ 
sor’s  register  wanders  out  of  a  limited  range.  This  premature  termination 
occurs  infrequently,  and  is  acceptable  in  a  shared  coin  that  does  not  need  to 
guarantee  consistency.  But  it  is  not  acceptable  for  a  robust  coin,  as  it  may 
allow  the  scheduler  to  force  some  processor  to  choose  one  value  (through 
premature  termination)  after  another  has  already  chosen  a  different  value 
(through  the  normal  workings  of  the  shared  coin  protocol.) 

A  simple  alternative  to  premature  termination  that  still  allows  the  size 
of  the  registers  to  be  bounded  is  to  store  the  remainder  of  each  processor’s 
contribution  relative  to  some  convenient  modulus  m  greater  than  the  total 
range  2 r  +  1.  The  counter  value  can  then  be  reconstructed  as  the  unique  v 
in  the  range  [— r,  r]  that  is  congruent  to  the  sum  of  the  registers,  modulo  m. 
Pseudocode  for  the  three  counter  operations  using  this  technique  is  shown 
in  Figure  4.4;  it  assumes  the  presence  of  an  array  of  registers  which  can  be 
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Shared  data: 

scannable  array  muni  [0 ...  n  -  1],  initialized  to  0 

procedure  increment.  () 
u  «—  count  [me] 
count[me\  «—  (v  4-  I )  mod  m 

procedure  decrement () 
v  <—  count  [me] 
count  [me ]  «—  (v  —  1 )  mod  m 

procedure  read[) 
scan  count 
v  <—  J2?=o  count[i\ 

return  v'  where  —r  <  o'  <  r  and  vr~  v  (mod  m) 

Figure  4.4:  Pseudocode  for  counter  operations. 

read  in  a  single  operation.  Such  an  array  can  be  simulated  using  an  atomic 
snapshot  protocol  [AAD+90,  All90b,  And90].  An  atomic  snapshot  is  an 
operation  that  returns  a  picture  of  the  values  in  all  of  the  registers  in  the 
array  that  is  consistent  with  other  pictures  returned  by  other  snapshots  and 
with  the  order  of  non-overlapping  write  operations  even  though  it  may  not 
correspond  to  the  actual  values  in  the  registers  at  a  particular  point  in  time. 
Typically,  it  is  necessary  to  make  writes  to  registers  in  the  scannable  array 
be  more  than  just  simple  writes  to  individual  registers,  so  taking  a  snapshot 
of  the  array  and  writing  to  an  element  of  the  array  will  both  be  expensive. 
However,  the  algorithm  of  Afek  el  al.  [AAD+9Q]  allows  an  atomic  scan  oper¬ 
ation  to  be  implemented  with  0(n2)  bits  of  extra  space  and  a  maximum  of 
0(n2)  primitive  read  and  write  operations  for  each  snapshot  and  each  write 
to  a  simulated  register  in  the  scannable  array.  Using  their  algorithm.  4.4 
implements  an  atomic  counter  where  each  counter  operations  costs  0{n2) 
register  operations. 
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4.4  The  randomized  consensus  protocol 

Figure  4.5  shows  pseudocode  for  each  processor’s  behavior  in  the  randomized 
consensus  protocol.  The  protocol  uses  three  shared  counters.  The  first  two 
maintain  a  total  of  the  number  of  participating  processors  that  started  with 
each  of  the  inputs  0  and  1.  The  last  is  used  as  the  counter  for  a  modified 
version  of  the  robust  shared  coin  protocol.  All  of  the  counters  have  an  initial 
value  of  0. 

The  protocol  is  optimized  for  the  case  where  few  processors  participate. 
We  will  define  a  processor  to  be  active  if  it  takes  at  least  one  step  before 
some  processor  decides  on  a  value,  and  denote  by  p  the  total  number  of  active 
processors  in  a  given  execution.  I'he  protocol  uses  the  counters  a0  and  aj 
to  keep  track  of  the  number  of  active  processors  bv  having  each  processor 
increment  one  or  the  other  of  these  counters  as  it  starts  the  protocol. 

The  protocol  depends  on  being  able  to  take  an  atomic  snapshot  of  the 
counters.  Since  the  first  two  counters  are  never  decremented,  such  a  snapshot 
can  be  obtained  as  described  in  Figure  4.6.  Though  the  operation  defined 
there  is  not  wait-free,  because  it  will  not  finish  if  a0  or  ai  changes  during 
some  pass  through  the  loop,  this  event  can  occur  at  most  p  times  during  any 
execution  of  the  consensus  protocol.  So  in  fact  the  time  to  carry  out  the 
atomic  snapshot  will  be  bounded  in  the  context  in  which  it  is  used. 

If  the  counters  are  not  primitives  but  are  instead  constructed  as  described 
in  Section  4.3  using  an  atomic  scan  operation,  the  overhead  of  Figure  4.6  can 
be  avoided  completely  by  simply  reading  all  three  counters  in  a  single  atomic 
scan  of  the  arrays  that  implement  them. 

Several  features  of  the  protocol  are  worth  noting.  First  of  all,  the  same 
“slopes”  that  ensured  consistency  for  the  robust  shared  coin  ensure  consis¬ 
tency  for  the  consensus  protocol,  for  the  same  reasons.  Second,  the  counters 
a0  and  a!  allow  the  protocol  to  guarantee  validity,  as  the  random  walk  is  only 
invoked  if  both  have  non-zero  values.  These  counters  are  also  used  to  min¬ 
imize  the  range  of  the  random  walk,  by  taking  advantage  of  the  fact  stated 
in  the  following  lemma,  a  modification  of  Lemma  4.6: 

Lemma  4.10  Let  ao,  at,c  hr  tin  values  read  from  the  counters  by  some  pro¬ 
cessor  and  t  the  true  position  of  the  random  walk  in  the  state  preceding  the 
read.  Then  |c  —  tj  <  ao  +  —  l . 

Proof:  There  are  at  most  a0  +  «i  —  1  processors  with  pending  increments  or 
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Shared  data: 

counter  ao  with  range  (0,n),  initialized  to  0 
counter  ai  with  range  [0,n],  initialized  to  0 
counter  c  with  range  [— 4n,4n],  initialized  to  0 

1  procedure  consensus(input) 

2  increment  ainput 

3  repeat 

4  readaoifli)c 

5  if  c  <  —2 n  then  output  0 

6  else  if  c  >  2n  then  output  1 

7  else  if  c  <  —  (a0  +  )  or  ai  =0  then  decrement  c 

8  else  if  c  >  (a0  4-  «i )  or  u0  =  0  then  increment  c 

9  else 

10  if  locaLflipO  =  1  then  increment  c 

11  else  decrement  r 


Figure  1.5:  Consensus  protocol. 


procedure  scan-counters () 
repeat 

ao  *—  read(ao) 
ai  ♦—  read(ai) 
c  *—  rtad{c) 
a'0  *—  read(ao) 
a[  *—  read' Oi) 
until  Oq  =  a0  and  a\  =  n{ 
return  ac^a^c 


Figure  4.6:  Counter  ,s<  ;ut  for  randomized  consensus  protocol. 
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decrements.  I 

To  prove  that  the  consensus  protocol  is  correct,  we  must  establish  that  it 
is  consistent,  that  it  terminates,  and  that  it  is  valid.  The  proof  of  consistency 
is  a  straightforward  modification  of  the  proof  of  Lemma  4.4: 

Lemma  4.11  [f  any  processor  rends  a  counter  value  v  >  2 n,  then  all  subse¬ 
quent  reads  will  return  values  >  n+  l;  in  the  symmetric  case  where  v  <  —2 n. 
all  subsequent  reads  will  return  values  <  —  (n  -f-  1). 

Proof:  Apply  the  proof  of  Lemma  4.4  with  K  =  n.  I 

Similarly,  the  proof  that  the  counter  c  does  not  overflow  is  a  straightfor¬ 
ward  modification  of  Lemma  4.5: 

Lemma  4.12  The  value  of  e  never  leaves  the  range  [— 4n,4n]  in  any  execu¬ 
tion  of  the  consensus  protocol. 

Proof:  Apply  the  proof  of  Lemma  4.5  with  K  —  n.  I 

Termination  is  trickier  to  demonstrate.  As  in  the  case  of  the  shared  coin, 
the  key  to  proving  the  consensus  protocol’s  termination  is  the  fact  that  the 
scheduler’s  only  alternative  to  moving  the  true  position  randomly  is  to  move 
it  away  from  the  origin.  In  the  shared  coin  protocol,  this  condition  depends 
on  fixing  the  parameter  /\  >  u.  In  the  consensus  protocol  the  situation 
is  more  complicated,  as  the  protocol  uses  its  knowledge  of  the  number  of 
currently  active  processors  to  set  the  inner  boundaries  of  the  slope  close  to 
the  origin  while  still  preventing  the  scheduler  from  being  able  to  force  the 
true  position  to  move  toward  the  origin. 

Lemma  4.13  Let  n  be  the  total  number  of  processors  and  p  be  the  number 
of  processors  that  take  at  least  one  step  before  some  processor  decides  on  a 
value.  Then  the  worst-case  expected  running  time  of  the  consensus  protocol 
is  0(p2  +  n)  total  counter  operations. 

Proof:  We  will  show  that  the  consensus  protocol  terminates  in  0{p2  +  n) 
time  by  reducing  it  to  a  controlled  random  walk  of  the  true  position  t.  Divide 
the  execution  of  the  protocol  into  two  phases.  In  the  first  phase,  at  most  one 
of  ao,  aj  is  nonzero;  if  the  execution  does  not  leave  the  first  phase  before 
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2 n  increments  or  decrements  have  occurred  the  protocol  will  terminate  after 
0(n)  additional  steps  by  Lemma  1.11. 

In  the  second  phase,  botli  nu  and  at  are  nonzero.  Let  v  be  a  value  read 
by  some  processor  from  the  counter  c.  By  Lemma  4.10  we  know  that  1 1  — 
v|  <  ao  +  ai  —  1  <  p  —  1.  Now,  to  force  an  increment  during  the  second 
phase  the  scheduler  must  show  a  processor  a  counter  value  v  that  is  at  least 
a0  4-  ai,  possibly  by  withholding  local  coin-flips  to  raise  the  value  of  c  or 
by  withholding  increments  to  lower  the  value  of  ao  +  a\-  In  either  case 
Lemma  4.10  applies  and  t  must  lx*  greater  than  0.  The  case  of  the  scheduler 
attempting  to  force  a  decrement  is  symmetric,  and  thus  in  either  case  the 
scheduler  can  only  force  the  true  position  to  move  away  from  0. 

Furthermore,  since  p  is  an  upper  hound  both  on  the  distance  between  c 
and  t  and  on  the  value  of  a„  4-  <tt,  if  |l|  >  2 p  then  |u|  >  a0  4-  ax  and  the 
true  position  will  move  away  from  0  thereafter.  Thus  the  second  phase  of  the 
execution  can  be  modeled  as  a  controlled  random  walk  in  the  sense  of  Lemma 
4.3  with  center  0,  barriers  at  ±2p,  and  a  starting  position  equal  to  the  true 
position  at  the  end  of  the  first  phase.  By  Lemma  4.3,  this  random  walk  will 
take  an  expected  0(p2)  steps,  each  consisting  of  a  constant  number  of  counter 
operations;  to  this  value  must  be  added  0(n)  steps  until  termination,  up  to 
0{n)  steps  from  the  first  phase,  and  0(p2)  extra  read  operations  due  to  extra 
passes  through  the  loop  in  scan.counlers().  The  total  expected  number  of 
counter  operations  is  thus  0{pl  4-  n).  I 

Note  that  the  expected  running  time  of  0(p2  4-  n)  is  expressed  in  total 
counter  operations.  If  the  counter  is  implemented  as  described  in  Section  4.3 
the  .total  number  of  register  operations  will  be  0{n2(p 2  4-  n)). 

Lemma  4.14  The  protocol  of  Figure  /,.5  satisfies  the  validity  condition. 

Proof:  Suppose  every  processor  starts  with  the  input  1.  Then  ao  is  never 
incremented  and  so  retains  its  initial  value  of  0  throughout  the  execution 
of  the  protocol.  Thus  each  processor  will  increment  c  until  it  reads  a  value 
v  >  2n  at  which  point  if  will  decide  l.  The  case  where  every  processor  has 
input  0  is  symmetric.  I 

Combining  the  lemmas  gives: 

Theorem  4.15  Figure  f.5  implements  a  consensus  protocol. 
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Proof:  Lemmas  4.11,  4.13,  ami  4.14.  I 

It  is  worth  looking  at  the  behavior  of  the  shared  coin  implicitly  embedded 
in  the  consensus  protocol  of  Figure  4.5.  Because  the  function  of  detecting 
agreement  is  implemented  in  the  shared  coin  itself,  limiting  scheduler  con¬ 
trol  over  the  outcome  of  the  shared  coin  is  no  longer  necessary  to  achieve 
consensus.  Thus  the  parameter  K  of  the  shared  coin  protocol  can  be  set  to 
minimize  the  time  taken  in  the  random  walk  without  regard  to  its  effect  on 
the  agreement  parameter  6.  In  the  protocol  of  Figure  4.5  the  shared  coin  has 
an  effective  agreement  parameter  of  as  low  as  is  possible  without  setting 

K  <  p. 

At  the  same  time,  the  simplicity  of  the  protocol  allows  the  number  and 
size  of  the  shared  counters  to  be  very  small.  Unfortunately,  when  the  avail¬ 
able  primitives  are  limited  to  atomic  registers  this  small  size  is  lost  in  the 
@(n2)  space  overhead  of  tin.'  atomic  scan  operation.  It  is  not  immediately 
clear  that  this  overhead  is  a  necessary  feature  of  an  atomic  counter  imple¬ 
mentation;  much  work  remains  to  be  done  in  this  area. 


Chapter  5 


Consensus  Using  Weighted 
Voting 


5.1  Introduction 

In  the  previous  chapter  we  built  a  consensus  protocol  that  directly  incorpo¬ 
rated  a  robust  shared  coin,  lien*  wo  will  show  how  to  construct  a  faster  but 
non-robust  shared  coin  which  gives  consensus  using  standard  constructions 
such  as  the  one  of  [AH90a]  described  in  Section  3.3. 

This  shared  coin  protocol  requires  a  departure  from  previous  practice.  As 
in  the  protocols  of  the  previous  chapter,  the  fundamental  technique  behind  all 
shared  coin  protocols  since  [AllDOa]  has  been  the  use  of  repeated,  equally- 
weighted  votes  to  reduce  the  impact  of  any  particular  processor’s  private 
knowledge  and  with  it  the  adversary’s  ability  to  affect  the  outcome  of  the 
coin.  There  are  many  advantages  to  this  approach.  The  processors  act  as 
anonymous  conduits  of  a  stream  of  unpredictable  random  increments.  If 
the  scheduler  stops  a  particular  processor,  at  worst  all  it  does  is  keep  one 
vote  from  being  written  out  to  the  common  pool — the  next  local  coin  flip 
executed  by  some  other  processor  is  no  more  or  less  likely  to  give  the  value 
the  scheduler  wants  than  the  next  one  executed  by  the  processor  it  has  just 
stopped.  Intuitively,  the  scheduler’s  power  over  the  outcome  of  the  shared 
coin  is  limited  to  filtering  out  up  to  n  —  l  local  coin  flips  from  this  stream 
of  independent  random  variables.  Hut  the  effect  of  this  filtering  is  at  worst 
equivalent  to  adjusting  the  final  tally  of  votes  by  up  to  n  —  I.  If  a  constant 
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multiple  of  n2  votes  are  cast,  tin?  total  variance  will  be  fl(n2).  Because  the 
total  vote  is  approximately  normally  distributed,  the  protocol  can  guarantee 
that  with  constant  probability  the  total  vote  is  more  than  n  away  from  the 
origin,  rendering  the  scheduler’s  adjustment  ineffective. 

Alas,  the  very  anonymity  of  the  processors  that  is  the  strength  of  the 
voting  technique  is  also  its  greatest  weakness.  To  overcome  the  scheduler’s 
power  to  withhold  votes,  it  is  necessary  that  a  total  of  fl(n2)  votes  are  cast — 
but  the  scheduler  might  also  choose  to  stop  all  but  one  of  the  processors, 
leaving  that  lone  processor  to  generate  all  fl(n2)  votes  by  itself.  It  follows 
that,  for  all  of  the  polynomial-time  wait-free  consensus  protocols  based  on 
unweighted  voting,  the  worst-case  expected  bound  on  the  work  done  by  a 
single  processor  is  asymptotically  no  better  than  the  bound  on  the  total 
work  done  by  all  of  the  processors  together. 

In  this  chapter  we  show  how  to  avoid  this  problem  by  modifying  a  pro¬ 
tocol  of  Bracha  and  Rachman  [MUDl]  to  allow  the  processor  to  cast  votes  of 
increasing  weight.  Thus  a  fast  processor  or  a  processor  running  in  isolation 
can  quickly  generate  votes  of  sufficient  total  variance  to  finish  the  protocol, 
at  the  cost  of  giving  the  scheduler  greater  control  by  allowing  it  both  to  with¬ 
hold  votes  with  larger  impact  and  to  choose  among  up  to  n  different  weights 
(one  for  each  processor)  when  determining  the  weight  of  the  next  vote. 

There  are  two  main  difficulties  that  this  approach  entails.  The  first  is  that 
careful  adjustment  of  the  weight  function  and  other  parameters  of  the  pro¬ 
tocol  is  necessary  to  make  sure  that  it  performs  correctly.  More  importantly, 
allowing  the  weight  of  the  i-th  vote  to  depend  on  the  particular  processor 
the  scheduler  chooses  to  run,  which  may  in  turn  depend  on  the  outcomes 
of  previous  votes,  means  that  we  cannot  treat  the  sequence  of  votes  as  a 
sequence  of  independent  random  variables. 

However,  the  sign  of  each  vote  is  determined  by  a  fair  coin  flip  that 
the  scheduler  cannot  predict  in  advance,  and  so  despite  all  the  scheduler’s 
powers,  the  expected  value  of  each  vote  before  it  is  cast  is  always  0.  This 
is  the  primary  requirement  of  a  martingale  process  [Bil86,  Fel71,  KopS4]. 
Under  the  right  conditions,  martingales  have  many  similarities  to  sequences 
of  sums  of  independent  random  variables.  In  particular,  martingale  analogues 
of  the  Central  Limit  Theorem  arid  Chernoff  bounds  will  be  used  in  the  proof 
of  correctness. 

The  rest  of  the  chapter  is  organized  as  follows.  Section  5.2  defines  tne 
shared  coin  protocol  and  gives  an  overview  of  its  operation.  Section  5.3 


1  procedure  shared.coin() 

2  begin 

3  my.reg(variance,  vote )  <—  (0,0) 

4  t  *-  1 

5  repeat 

0  for  i  =  1  to  c  do 

7  vote  *—  local  Jlip()  x  w(t.) 

8  my.reg  <—  (my.reg. variance  4-  w(t)2 ,  my.reg. vote  +  vote ) 

9  t  <-  t  +  1 

10  end 

11  read  all  the  registers,  summing  the  variance  fields  into  the 
local  variable  total. variance. 

12  until  totaLvariann  >  A' 

13  read  all  the  registers,  summing  the  vote  fields  into  the  local  vari¬ 
able  tolaLvolt 

14  if  totaLvote  >  0 

15  then  output  1 

16  else  if  totaLvote.  <  0 

17  then  output  0 

18  else  fail 

19  end 

Figure  5.1:  Shared  coin  protocol. 

contains  a  brief  definition  of  martingales  and  describes  some  of  their  proper¬ 
ties.  Finally,  Section  5.4  proves  the  correctness  of  the  protocol  for  two  sets 
of  parameters,  one  of  which  allows  it  to  simulate  the  equally-weighted  vot¬ 
ing  protocol  of  [BRSlJ,  and  one  which  gives  a  bound  of  0(n  log2  n)  on  the 
expected  number  of  operations  executed  by  a  single  processor. 

5.2  The  shared  coin  protocol 

Figure  5.1  gives  pseudocode  for  each  processor’s  behavior  during  the 
shared  coin  protocol.  Each  processor  repeatedly  flips  a  local  coin  that  re¬ 
turns  the  values  4-1  and  —  I  with  equal  probability.  The  weighted  value  of 


each  flip  is  iu(t)  or  —  w(t)  respectively,  where  t  is  the  number  of  coins  flipped 
by  the  processor  up  to  and  itiduding  its  current  flip.  Each  weighted  flip 
represents  a  vote  for  either  the  output  value  1  (if  positive)  or  0  (if  negative). 
After  each  flip,  the  processor  updates  its  register  to  hold  the  sum  of  the 
weighted  flips  it  has  performed,  and  the  sum  of  the  squares  of  their  values. 
After  every  c  flips,  the  processor  reads  the  registers  of  all  the  other  proces¬ 
sors,  and  computes  the  sum  of  all  the  weighted  flips  (the  total  vote)  and  the 
sum  of  the  squares  of  their  values  (the  total  variance).  If  the  total  variance 
is  greater  than  the  quorum  /\',  it  stops,  and  outputs  1  if  the  total  vote  is 
positive,  and  0  if  it  is  negative  (it  treats  a  total  vote  of  zero  as  a  failure  to 
avoid  introducing  asymmetry  between  the  two  outcomes).  Alternatively,  if 
the  total  variance  has  not  yet  reached  the  quorum  K,  it  continues  to  flip  its 
local  coin. 

As  in  the  previous  chapter,  the  function  locaLflip  returns  the  values  1  and 
—  1  randomly  with  equal  probability.  The  values  K  and  c  are  parameters  of 
the  protocol  which  will  be  set  depending  on  the  number  of  processors  n  to 
give  the  desired  bounds  on  the  agreement  parameter  and  running  time.  The 
weight  function  w{t)  is  used  to  make  later  local  coin  flips,  have  more  effect 
than  earlier  ones,  so  that  a  processor  running  in  isolation  will  be  able  to 
achieve  the  quorum  K  quickly.  The  weight  function  will  be  assumed  to  be 
of  the  form  w(t)  =  ta  where  a  is  a  nounegative  parameter  depending  on  n; 
though  other  weight  functions  arc'  possible,  this  choice  simplifies  the  analysis. 

We  will  demonstrate  that  for  suitable  choice  of  K,  c  and  a  all  processors 
return  1  with  constant  probability;  the  case  of  ail  processors  returning  0 
will  follow  by  symmetry.  The  struc  ture  of  the  argument  follows  the  proof  of 
correctness  of  the  less  sophisticated  protocol  of  Bracha  and  Rachman  [BR91], 
which  corresponds  to  Figure'  .r>. I  when  w(t.)  is  the  constant  1,  K  =  9(n2). 
and  c  =  0(n/logn).  Votes  cast  before  the;  quorum  K  is  reached  will  form 
a  pool  of  common  votes  that  all  processors  see.1  We  will  show  that  with 
constant  probability  (i)  the  total  of  the  common  votes  is  far  from  the  origin 
and  (ii)  the  sum  of  the  extra  votes  cast  between  the  time  the  quorum  is 
reached  and  the  time  some  processor  does  its  final  read  in  line  13  is  small, 
so  that  the  total  vote  read  by  each  processor  will  have  the  same  sign  as  the 
total  common  vote. 

'The  definitions  of  the  common  and  extra  vote's  we  will  use  differ  slightly  from  those 
used  in  [BR91];  the  formal  dcfinit  ions  appear  in  Section  5.4. 


This  simple  overview  of  the  proof  hides  many  tricky  details.  To  simplify 
the  analysis  we  will  concentrate  not  on  the  votes  actually  written  to  the 
registers  but  on  the  votes  whose  values  have  been  decided  by  the  processors' 
execution  of  the  local  coin  flip  in  line*  7;  conversion  back  to  the  values  actually 
in  the  registers  will  be  done  by  showing  a  bound  on  the  difference  between 
the  total  decided  vote  anti  the  total  of  the  register  values.  In  effect,  we  are 
treating  a  vote  as  having  been  “cast”  the  moment  that  its  value  is  determined, 
instead  of  when  it  becomes  visible  to  the  other  processors. 

Some  care  is  also  needed  to  correctly  model  the  sequence  of  votes.  Most 
importantly;  as  pointed  out  above,  allowing  the  weight  of  the  i-th  vote  to 
depend  on  which  processor  the  scheduler  chooses  to  run  means  the  votes  are 
not  independent.  So  the  straightforward  proof  techniques  used  for  protocols 
based  on  a  stream  of  identically-distributed  random  votes  no  longer  apply, 
and  it  is  necessary  to  bring  in  the  theory  of  martingales  to  describe  the 
execution  of  the  protocol. 

5.3  Martingales 

A  martingale  is  a  sequence  of  random  variables  Si,  S2,  ■  . .,  which  informally 
may  be  thought  of  as  representing  the  changes  in  the  fortune  of  a  gambler 
playing  in  a  fair  casino.  Because  the  gambler  can  choose  how  much  to  bet  or 
which  game  to  play  at  eacli  instant,  each  random  variable  5,  may  depend  on 
all  previous  events.  But  because  the  casino  is  fair  and  the  gambler  cannot 
predict  the  future,  the  expected  change  in  the  gambler’s  fortune  at  any  play 
is  always  0. 

We  will  need  to  use  a  very  genera!  definition  of  a  martingale  [Bil86,  FelTl, 
Kop84].  The  simplest  definition  of  a  martingale  says  that  the  expected  value 
of  S,+ 1  given  S\,  Sj, . . . ,  Si  is  just  S\.  To  use  a  gambling  analogy,  this  defini¬ 
tion  says  that  a  gambler  who  knows  only  the  previous  values  of  her  fortune 
cannot  predict  its  expected  future  value  any  better  than  by  simply  using  its 
current  value.  But  what  if  the  gambler  knows  more  information  than  just 
the  changing  size  of  her  bankroll?  For  example,  imagine  that  she  is  placing 
bets  on  a  fair  version  of  roulette,  and  always  bets  on  either  red  or  black. 
Knowing  that  her  fortune  increased  after  betting  red  will  tell  her  only  that 
one  of  eighteen  red  numbers  came  up;  but  a  real  gambler  will  see  precisely 
which  of  the  eighteen  numbers  it  was.  Still,  we  would  like  to  claim  that  this 
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additional  knowledge  does  not.  ailed,  her  ability  to  predict  the  future.  To 
do  so,  the  definition  of  a  martingale  must  be  extended  to  allow  additional 
information  to  be  represented  explicitly. 

The  tool  used  to  represent  the  information  known  at  any  point  in  time 
will  be  a  concept  from  measure  theory,  a  rr-algebra2  The  description  given 
here  is  informal;  more  complete  definitions  can  be  found  in  [Fel71,  Sections 
IV. 3,  IV.4,  and  V.ll]  or  [BilS(i). 

5.3.1  Knowledge,  cr-algebras,  and  measurability 

Recall  that  any  probabilistic  statement  is  always  made  in  the  context  of  some 
(possibly  implicit)  sample  space.  The  elements  of  the  sample  space  (called 
sample  points)  represent  all  possible  results  of  some  set  of  experiments, 
such  as  flipping  a  sequence  of  coins  or  choosing  a  point  at  random  from  the 
unit  interval.  Intuitively,  all  randomness  is  reduced  to  selecting  a  single  point 
from  the  sample  space.  An  event,  such  as  a  particular  coin-flip  coming  up 
heads  or  a  random  variable  taking  on  the  value  0,  is  simply  a  subset  of  the 
sample  space  that  “occurs”  if  one  of  the  sample  points  it  contains  is  selected. 

If  we  are  omniscient,  we  can  see  which  sample  point  is  chosen  and  thus 
can  tell  for  each  event  whether  it  occurs  or  not.  However,  if  we  have  only 
partial  information,  we  will  not  be  able  to  determine  whether  some  events 
occurred  or  not.  We  can  represent  the  extent  of  our  knowledge  by  making 
a  list  of  all  events  we  do  know  about.  This  list  will  have  to  satisfy  certain 
closure  properties;  for  example,  if  we  know  whether  or  not  A  occurred,  and 
whether  or  not  B  occurred,  then  we  should  know  whether  or  not  the  event 
“A  or  B ”  occurred. 

We  will  require  that  the  sot  of  known  events  be  a  cr-algebra.  A  (7-algebra 
T  is  a  family  of  subsets  of  a  sample  space  ft  that  (i)  contains  the  empty  set: 
(ii)  is  closed  under  complement:  if  T  contains  A,  it  contains  ft\  A  (the  com¬ 
plement  of  A);  and  (iii)  is  closed  under  countable  union:  if  F  contains  all  of 
Ai,  A?, . . .,  it  contains  A,.:’  An  event  A  is  said  to  be  ^-measurable  if  it 
is  contained  in  T .  In  our  context,  the  term  “measurable,”  which  comes  from 
the  original  measure-theoretic  use  of  (7-algebras  to  represent  families  of  sets 
on  which  a  probability  distribution  is  well-defined,  simply  means  “known.” 

2 Sometimes  called  a  (7-field. 

Additional  properties,  such  as  being  dosed  under  finite  union  or  intersection,  follow 
immediately  from  this  definition. 


We  “know”  about  an  event,  if  we  c  an  determine  whether  or  not  it  occurred. 
What  about  random  variables?  A  random  variable  X  is  defined  to  be  T- 
measurable  if  every  event  of  tin*  form  X  <  c  is  F- measurable.  (The  closure 
properties  of  F  then  imply  that  such  events  as  a  <  .Y  <  6,  X  —  d,  and 
so  forth  Jure  also  F-measurable.)  Looking  at  the  situation  in  reverse,  given 
random  variables  AT,  A2, ...  we  c  an  consider  the  minimum  cr-algebra  F  for 
which  each  of  the  random  variable's  is  F-measurable;  this  cr-algebra,  written 
( X{ ),  is  called  the  cr-algebra  generated  by  Xi,A2,...,  and  represents  all 
information  that  can  be  inferred  from  knowing  the  values  of  the  generators. 

A  cr-algebra  gives  us  a  rigorous  way  to  define  “knowledge”  in  a  probabilis¬ 
tic  context.  Measurability  and  generated  cr-algebras  give  us  a  way  to  move 
back  and  forth  between  the  abstract  concept  of  a  cr-algebra  and  concrete 
statements  about  which  random  variable's  are  completely  known.  To  analyze 
random  variables  that  are  only  partially  known,  we  need  one  more  definition. 
We  need  to  extend  conditional  expectations  so  that  the  condition  can  be  a 
cr-algebra  rather  than  just  a  collection  of  random  variables. 

For  each  event  A  let  I  a  be  the  indicator  variable  that  is  1  if  A  occurs 
and  0  otherwise.  Let  (J  =  E[.Y  |  Fj  be  a  random  variable  such  that  (i)  U 
is  F-measurable  and  (ii)  F>[///,i)  *  E(X/.j]  for  all  A  in  F.  The  random 
variable  E[X  |  F]  is  called  the  conditional  expectation  of  X  with  respect 
to  F  [Fel71,  Section  V.  11].  Intuitively,  the  first  condition  on  E[X  |  F]  says 
that  it  reveals  no  information  not  already  found  in  F.  The  second  condition 
says  that  just  knowing  that  some  event  in  F  occurred  does  not  allow  one 
to  distinguish  between  X  and  K[.Y  |  F];  this  fact  ultimately  implies  that 
E[X  |  F]  uses  all  information  that  is  found  in  F  and  is  relevant  to  X. 

If  F  is  a  generated  by  random  variables  AT ,  X2,  ■  ■ .,  the  conditional  expec¬ 
tation  E(A"  |  F]  reduces  to  the  simpler  version  E[X  |  X\,  X2,  •  ■  •]•  Some  other 
facts  about  conditional  expectation  that  we  will  use  (but  not  prove):  if  X  is 
•F- measurable,  then  E[A V  |  T\  —  X K[V  j  T\  (which  implies  E[A"  |  F\  =  A’): 
and  if  P  C  then  E(E[.Y  |  T\  |  F\  =  E(.Y  |  F\.  See  (FelTl,  Section  V.U]. 

5.3.2  Definition  of  a  martingale 

We  now  have  the  tools  to  define  a  martingale  when  the  information  available 
at  each  point  in  time  is  not  limited  to  just  ,  the  values  of  earlier  random 
variables.  A  martingale  { 5, ,  .F, } ,  l  <  i  <  n,  is  a  stochastic  process  where 
each  S ,  is  a  random  variable  representing  the  state  of  the  process  at  time  i  and 
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F,  is  a  <7-algebra  representing  tin*  knowledge  of  the  underlying  probability 
distribution  available  at  time  i.  Martingales  are  required  to  satisfy  three 
axioms,  for  all  i : 

1.  *£*+i.  (The  past  is  never  forgotten.) 

2.  Si  is  .^-measurable.  (The  present  is  always  known.) 

3.  E(S,+i  |  Fi\  —  Si.  (The  future  cannot  be  foreseen.) 

Often  F\  will  simply  be  the  /r- algebra  (Si,...  Si)  generated  by  the  vari¬ 
ables  Si  through  Si\  in  this  case  axioms  1  and  2  will  hold  automatically. 

To  avoid  special  cases  let  Fv  denote  the  trivial  <r-algebra  consisting  of  the 
empty  set  and  the  entire  probability  space.  'The  difference  sequence  of  a 
martingale  is  the  sequence  .Y(,  .Y*, . . .  .Yn  where  .Y\  =  S\  and  Xi  =  5,  —  5,_ t 
for  i  >  1.  A  zero-mean  martingale  is  a  martingale  for  which  E[5,-]  =  0. 

5.3.3  Gambling  systems 

A  remarkably  useful  theorem,  which  has  its  origins  in  the  study  of  gambling 
systems,  is  due  to  Halmos  (Hal.39j.  We  restate  his  theorem  below  in  modern 
notation: 

Theorem  5.1  Let  {5,,^i}  ,1  <  i  <  n  be  a  martingale  with  difference  se¬ 
quence  {A^}.  Let  {£;}  ,  1  <  i i  <  n  be  random  variables  taking  on  the  values 
0  and  1  such  that  each  Q  is  F,-\- measurable.  Then  the  sequence  of  random 
variables  S •  =  ( jXj  is  a  martingale,  relative  to  F,. 

Proof:  The  first  two  properties  are  easily  verified.  Because  is  F,-\- 
measurable,  EfcXi  |  Fi-\\  =  0E(.Y,  j  F,-{]  =  0,  and  the  third  property  also 
follows.  I 


5.3.4  Limit  theorems 

Many  results  that  hold  for  sums  of  independent  random  variables  carry  over 
in  modified  form  to  martingales.  For  example,  the  following  theorem  of 
Hall  and  Heyde  (HH80,  Theorem  3.9]  is  a  martingale  version  of  the  classical 
Central  Limit  Theorem: 
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Theorem  5.2  ([HH80])  Lei  Tt}  be  a  zero-mean  martingale.  Let  Vf  = 
E[A?  |  and  let  0  <  6  <  1.  Define  Ln  =  E?.i  E[|Xt|2+25j  + 

E[|Vn2  -  1|1+5].  Then  there  crisis  a  constant  C  depending  only  on  6  such 
that  whenever  Ln  <  1, 

|Pr[S„  <  i|  -  *(x)|  <  f  •/.:««»»  [14.|x|J,VirtT3^)]  .  (5.1) 

where  $  is  the  standard  unit  normal  distribution  with  mean  0  and  variance 

1. 

If  we  are  interested  only  in  tin*  tails  of  the  distribution  of  Sn>  we  can 
get  a  tighter  bound  using  A/.uma's  inequality,  a  martingale  analogue  of  the 
standard  Chernolf  bound  [( 'he')2]  Ibr  sums  of  independent  random  variables. 
The  usual  form' of  this  bound  (see  [AS9‘2,  Spe87])  assumes  that  the  difference 
variables  X,  satisfy  |.Y,|  <  l.  This  restriction  is  too  severe  for  our  purposes, 
so  below  we  prove  a  generalization  of  the  inequality.  In  order  to  do  so  we 
will  need  the  following  technical  lemma. 

Lemma  5.3  Let  {5j,Fi},l  <  i  <  n  be  a  zero-mean  martingale  with  dif¬ 
ference  sequence  {A-,}.  Let  To  Q  T\  be  a  ( not  necessarily  trivial)  cr -algebra 
such  that  E[Si  |  Fo]  =0.  If  I  lie  it.  exists  a  sequence  of  random  variables 
w\,wi, . . .  wn,  and  a  random  van  able  W,  such  that 

1.  W  is  To- measurable, 

2.  Each  W{  is  T,-i -measurable, 

3.  For  all  i,  |A,(  <  Wi  with  pro  liability  I,  and 
4 •  Hu-i  <  W  with  probability  I , 

then  for  any  a  >  0, 

4  ‘,s"  |  To]  <>  "iw/2  (5.2) 

Proof:  The  proof  is  by  induction  on  n.  Using  the  convexity  of  eox  and  the 
fact  that  E[.Yi  |  F0]  =  0,  we  have 

E[eoXl  (  To]  <  i  (c“""M  +e"W|)  s  cosh  au>,  < 
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If  n  =  1  we  are  done,  since  a/f  <  W.  If  n  is  greater  than  1,  for  each  i  < 
n-  1  let  5-  =  Si+ 1  -AT  and  T\  =  jF.+i-  Tlic'n  {S',/-'}  ,  I  <  i  <  n-  1  satisfies 
the  conditions  of  the  lemma  with  iF't  =  T\,  w\  =  W{±\  and  W  =  W  —  w[.  so 
by  the  induction  hypothesis  K|r"'s*— •  |  J-'n  <  e«>a( vv'— «"i )/2 .  But  then,  using 
the  fact  that  E[.V  |  T\  =  E[E[.Y  |  f'\  |  T\  when  T  C  T' ,  we  can  compute: 

E[eaSn  |  f0]  =  E[K[eaJf*eat&*-x‘)  |  Fx\  \  Fq\ 

=  F.[ca'v‘E[e^-.  \rQ]  \fo] 

<  u'-o,?)/2  j  jrQ\ 

I 

Theorem  5.4  Let  {5;,jp‘i}J  <  i  <  n  be  a  zero-mean  martingale  with 
difference  sequence  {X,}.  If  I  lure  exists  a  sequence  of  random  variables 
W\,  w?,  ■  ■ .  wn,  and  a  constant  H'\  such  that 

1.  Each  w,  is  Ti-\-measurubl< . 

2.  For  all  i ,  |A,|  <  w,  with  probability  l,  and 

3.  2Zr=i  w}  <  W  with  probability  1 , 
then  for  any  A  >  0, 

Pr[.s’„  >  A]  <  c-V/2lv'.  (5.3) 

Proof:  By  Lemma  5.3,  for  any  n  >  0,  E[r"‘s,,|  <  ea^w^.  Thus  by  Markov  T 
inequality 

Pr[Sn  >  A]  =  Pr  [r"'s,‘  >  r,,A]  <  e°2w/2e~a\ 

Setting  a  =  A fW  gives  (5.3).  I 
Symmetry  immediately  gives  us: 
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Corollary  5.5  For  any  martingale  { 5, ,  J-, }  satisfying  the  premises  of  The¬ 
orem  5-4,  and,  any  A  >  0 

Pr[.s'„  <  -Aj  <  <■-'*/*".  (5.4) 

Proof:  Replace  each  S',  by  —  and  apply  'Theorem  5.4.  I 

5.4  Proof  of  correctness 

For  this  section  we  will  fix  a  particular  scheduler.  We  may  assume  without 
loss  of  generality  that  the  scheduler  is  deterministic,  because  any  random 
inputs  the  scheduler  might  use  cannot  depend  on  the  history  of  an  execution 
and  therefore  may  also  be  fixed  in  advance. 

Consider  the  sequence  of  random  variables  Xi ,  X2,  ■  ■  .  where  X,  represents 
the  i-th  vote  that  is  decided  by  some  processor  executing  line  7,  or  0  if  fewer 

than  i  local  coin  flips  occur.  For  each  t  let  Tx  be  (Xi - Yt),  the  cr-algebra 

generated  by  X\  through  X,.  Because  the  scheduler  is  deterministic,  all  of 
the  random  events  in  the  system  preceding  the  f-th  vote  are  captured  in  the 
variables  Xx  through  Xi_i,  and  the  rr- algebra  Ti- 1  thus  determines  the  entire 
history  of  the  system  up  to  but  not  including  the  i-th-vote.  Furthermore, 
since  the  scheduler’s  behavior  depends  only  on  the  history  of  the  system. 
JF,_i  in  fact  determines  the  scheduler’s  choice  of  which  processor  will  cast 
the  i-th  vote.  Thus  conditioned  on  X,  is  just  a  random  variable  which 

takes  on  the  values  ±iu  with  equal  probability  for  some  weight  w  determined 
by  the  scheduler’s  choice  of  which  processor  to  run.  Hence  E[X,  |  jF,_i]  =  0. 
and  the  sequence  of  partial  sums  S,  =  ,  X ,  is  a  martingale  relative  to 

{*}• 

We  are  not  going  to  analyze  { St , Tx }  directly.  Instead,  it  will  be  used  as 
a  base  on  which  other  martingales  will  be  built  using  Theorem  5.1. 

Let  kx  =  1  if  Xj  <  l\  and  0  otherwise.  Votes  for  which  /c,  =  1  will  be 
called  common  votes.  For  each  processor  P  let  £p,i  =  1  if  the  vote  Xt  occurs 
before  P  reads,  during  its  final  read  in  line  13,  the  register  of  the  processor 
deciding  X,,  and  let  (ptl  =  0  otherwise.  In  effect,  (ptl  is  the  indicator  variable 
for  whether  P  would  see  .V,  if  if  were  written  out  immediately.  Observe 
that  for  a  fixed  scheduler  the  values  of  both  kx  and  (pt,  can  be  determined  by 
examining  the  history  of  the  system  up  to  but  not  including  the  time  when  the 
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vote  Xi  is  cast,  and  thus  both  «,  and  are  .T7,-! -measurable.  Consequently 
the  sequences  KjXj]  an<^  C P,jXj}  are  martingales  relative  to 

{.T7}  by  Theorem  5.1.  Votes  for  which  ( pti  =  1  but  m  =  0  will  be  referred  to 
as  the  extra  votes  for  processor  P.  (Observe  that  £/>,•  >  «,  since  P  could 
not  have  started  its  final  read  until  the  total  variance  wets  at  least  K .)  The 
sequence  —  of  the  partial  sums  of  these  extra  votes  is  a 

difference  of  martingales  and  is  thus  also  a  martingale  relative  to  {P,}. 

The  structure  of  the  proof  of  correctness  is  as  follows.  First,  we  show 
that  the  distribution  of  the  total  common  vote,  £  *iX,,  is  close  to  a  normal 
distribution  with  mean  0  and  variance  K  for  suitable  choices  of  a  and  I\  : 
in  particular,  for  n  sufficiently  large,  the  probability  that  £  K,Xi  >  x\/~K 
will  be  at  least  a  constant  for  any  fixed  x.  Next,  we  complete  the  proof  by 
showing  that  if  the  total  common  vote  is  far  from  the  origin  the  chances 
that  any  processor  will  read  a  total  vote  whose  sign  differs,  from  the  common 
vote  is  small.  This  fact  is  itself  shown  in  two  steps.  First,  it  is  shown 
that,  for  suitable  choice  of  c,  the  total  of  the  extra  votes  for  a  processor  P. 
I2((p,i  —  Ki)Xi,  will  be  small  with  high  probability.  Second,  a  bound  A  is 
derived  on  the  difference  between  Ti&.iXi  and  the  total  vote  actually  read 
by  P. 

It  will  be  necessary  to  select  values  for  o,  K,  and  c  that  give  the  correct 
bounds  on  the  probabilities.  However,  we  will  be  in  a  better  position  to 
justify  our  choice  for  these  parameters  after  we  have  developed  more  of  the 
analysis,  so  the  choice  of  parameters  will  be  deferred  until  Section  5.4.5. 

5.4.1  Phases  of  the  protocol 

We  begin  by  defining  the  phases  of  the  protocol  more  carefully.  Let  t,  be 
the  value  of  the  i-th  processor’s  internal  variable  t  at  any  given  step  of  the 
protocol.  Let  Ui  be  the  random  variable  representing  the  maximum  value  of 
ti  during  the  entire  execution  of  the  protocol.  Let  T,  be  the  random  variable 
representing  the  maximum  value  of  f,  during  the  part  of  the  execution  of  the 
protocol  where  k,  =  1. 

In  the  proof  of  correctness  we  will  encounter  many  quantities  of  the  form 
£{Ti)  or  £"=  i  f ( Lf j )  for  various  functions  if.  We  will  want  to  get  bounds 
on  these  quantities  without  having  to  look  too  closely  at  the  particular  values 
of  each  T,  or  Ui.  This  section  proves  several  very  general  inequalities  about 
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quantities  of  this  form,  all  of  which  are  ultimately  based  on  the  following 
constraint: 

T,  Ti  T.2a+1 

^  =  (5-5» 

The  constant  2a  +  1  will  reappear  often;  for  convenience  we  will  write  it  as 
A.  As  noted  above,  a  >  0,  and  hence  A  >  1. 

Define  Tk  =  (~)  ^  ,  so  that  K  —  The  constant  Tk  represents 

the  maximum  value  of  each  T,  if  they  are  set  to  be  equal  while  satisfying 
inequality  (5.5).  Note  that  Tk  need  not  be  integral.  Now  we  can  show: 

Lemma  5.6  Let  rp(x)  =  xA/A  and  let  x  be  any  strictly  increasing  function 
such  that  is  concave.  Then  for  any  non-negative  {x,},  tf'Efi-  i  v(x,)  < 

K,  then  £"=1  x(*»)  <  nx(7V). 


Proof:  Since  xrlJ~l  is  concave,  we  have 


x(*«) 


s*-1  E 


X-  L 


[HLP52,  Theorem  92].  Simple  algebraic  manipulation  yields 


£x(*<)  <  nX  U"1  E 


0(*i) 


out 

Hence  £x(x,)  <  nx{TK).  I 

Letting  x  be  the  identity  function  we  have  xi/,~1(x)  =  (Ax)1/4,  which  is 
concave  for  A  >  1.  Hence: 


Corollary  5.7 


ETi^nTK. 


In  the  case  where  x^'1  is  convex,  the  following  lemma  applies  instead: 
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Lemma  5.8  Let  0(x)  =  x A/A  and  let  x  be  any  strictly  increasing  function 
such  that  x4>~1  Is  convex.  Then  for  any  non-negative  fx,}.  if  xA /  A  < 
K,  then  ZU  X(*i)  <  (n~  l)x(O)  +  x(nl'ATK). 


Proof:  Let  Y  =  Now  xixi)  =  or 


xV>  1 


0  + 


My) 


which  is  at  most 


given  the  convexity  of  xV’-1-  Hence 

tx(x.)  <  nx^'(0)-(t^)xK--1(0)+(t^)Y^,(V) 

=  (n  -  l)x^_1(0)  +  x^"1  ^(x‘) 

\i=i 

<  (n-lW-'W  +  x^iK) 

which  is  just  (n  -  l)x(0)  +  x  (”1/‘47’k).  I 

The  quantity  ux!aTk  is  the  maximum  value  that  any  x,  can  take  on 
without  violating  the  constraint  on  £x,.  So  what  Lemma  5.S  says  is  that 
if  x^_1  is  convex,  £x(*«)  >s  maximized  by  maximizing  one  of  the  x,  while 
setting  the  rest  to  zero. 

For  the  variables  U,  we  can  show: 


Lemma  5.9  Let  xf(x)  —  xA / A  and  let  \  be  any  strictly  increasing  function 
such  that  x(0-l(I)  +  c-f  1)  is  concave  in  x.  Then  for  any  non-negative  {xj. 
2/E?=i  then 


Y,Xm<nx(TK  +  c+ 1)  (5.7) 

i=i 
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Proof:  Let  Wx  be  the  number  of  votes  written  to  the  registers  during  the  part 
of  the  execution  where  the  total  of  the  register  variance  fields  is  less  than  or 
equal  to  K.  The  set  of  variables  {W,}  satisfies  the  inequality  £  W-A/ A  <  K 
using  the  same  argument  as  gives  (5.5).  Furthermore  f/,  <  Wx  + 1  +  c,  because 
after  the  i-th  processor’s  next  vote  the  total  variance  in  the  registers  must 
exceed  K  and  it  can  cast  at  most  c  more  votes  before  noticing  this  fact. 
Define  x'(x)  =  x(x  +  c+l)-  Then  \(Ut)  <  x(W,+c+l)  =  x'(^)-  But  tv.  \' 
satisfy  the  premises  of  Lemma  5.6  and  thus  H**=lxW)  <  H"=i  X^W'J  < 
nx'(TK )  =  tix{Tk  +  c+  1).  I 

Setting  x(x)  to  x  gives 

Corollary  5.10 

E^<n(2W  +  c+l)  (5.3) 

i=i 

Proof:  x(^_1(x)  +  c+  1)  =  Axi/A  +  c  +  1,  which  is  concave  since  A  >  1.  I 

Define  g  =  1  +  then  gTx  =  Tk  +  c  +  3  will  be  an  upper  bound  for 
Tk  +  c  +  1  as  well  as  a  number  of  closely  related  constants  involving  c  that 
will  appear  later. 

5.4.2  Common  votes 

The  purpose  of  this  section  is  to  show  that  for  n  sufficiently  large,  the  total 
common  vote  is  far  from  the  origin  with  constant  probability.  We  do  so  by 
showing  that  under  the  right  conditions  the  total  common  vote  will  be  nearly 
normally  distributed. 

Let  Sn.i  =  H)=\  kjXj-  As  pointed  out  above,  {-SW,,  =  !Cj=i  j  is 

a  martingale.  Let  N  =  f nT^ ].  It  follows  from  Corollary  5.7  that  k,  =  0  for 
i  >  N  and  thus  Sk.n  —  lim,_oo  5k\ ,  is  the  sum  of  all  the  common  votes.  The 
distribution  of  Sk.n  is  characterized  in  the  following  lemma. 

Lemma  5.11  If 

JAL<, 

n 'l*TK  -  ’ 

then  for  any  x, 

|Pr[s*.„  <  *Sk\  -  *(*)|  <  c,  (^)'/S 


(5.9) 

(5.10) 
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where  C\  is  an  absolute  constant. 


Proof:  The  proof  uses  Theorem  5.2,  which  requires  that  the  martingale  be 
normalized  so  that  the  total  conditional  variance  V$  is  close  to  1.  So  let 
Yi  =  and  consider  the  martingale  {^=1  To  apply  the  theorem 

we  need  to  compute  a  bound  on  the  value  Ls-  We  will  fix  6  =  1. 

We  begin  by  getting  a  bound  on  the  first  term  £  E[|K|2+W]  .  We  have 


N 


£e[ik-I4 

;=i 

Now, 


=  E 


■  /V 

.i=l 


N 


,1=1 


'  n  r, 

.=1 j=i 


(o.i i; 


J'  rT,  7^40+1 

£  JAa  <  /  4?  +  T*  =  fi— -  +  7?° 

Jo  4a  +  1 


Define  0(x)  =  xA/A,X{x )  =  taking  0°  =  1.  Then  = 


J 

4a +1 


is  at  most 


( Ay)4a/A  +  is  convex,  and  hence  £?=1  (r?a  + 

(n1/ATfc)4a  +  ^  4^( - +  (n-  l)A  (0)  using  Lemma  5.8.  If  a  is  positive  then 

X(0)  is  zero;  however  if  a  is  zero  \(0)  will  be  1.  In  either  case  (n  -  1)\(0)  < 
n  -  1.  Plugging  everything  value  back  into  (5.11)  gives 

^FUY\<]  -  (n‘M7»4Q  ,  (n''ATKr+'  ,  n-1 

£Eimn<— —  4-  —  (4a  +  i)  (,.i2) 


For  the  second  term  E[|V$  -  l|l+5l,  observe  that 


Vi  =  £e[v'!  I JF,.,]  =  I 

•=1  A  .=1 


which  is  just  1/A  times  the  sum  of  the  squares  of  the  weights  |.Y,|  of  the 
common  votes.  But  the  total  variance  of  the  common  votes  can  differ  from 
K  by  at  most  the  variance  of  the  first  vote  Xt  for  which  k ,  =  0.  Since  the 
processor  that  casts  this  vote  can  have  cast  at  most  nl^AT^-  votes  beforehand, 
the  variance  of  this  vote  is  at  most  (nllATh-  +  l)2“,  giving  the  bound 


(5.13) 
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Combining  (5.12)  and  (5.13)  gives 

(nV*7V)4°  ,  (nl/ATk-)4a+l  ^  n  -  1  ^  [nUATK  +  Q 
Ln  ~  K2  +  A'2(4a  +  1)  +  A'2  I< 

n4o/.4y >4a  n(4a+l)//»7’^+l  A2(tl  —  1) 

=  K2  +  A'2  (4a  +  lY +  n2T2^ 

n2alAT%{\  +n-‘/^1)2a 

<  A2n-2,AT?  +  A2n"',A^  +  A'n'nu 

~  Irt  +  1 

6A2 

-  ti1/aTk 


The  second-to-last  step  uses  the  approximation  (l+i)fc  <  ebx  for  non-negative 
b  and  x.  The  exponential  term  is  serendipitously  bounded  by  e  if  (5.9)  holds, 
since  6A2(n1//4T/v  )""1  <  1  implies  that  {nl/ATK)~2a  is  also  at  most  1. 

A  more  direct  application  ol  (5.9)  shows  that  Ln  <  1,  and  thus  Theorem 
5.2  applies.  Hence 


Pr  ]T  KtXi  <  xy/K\  -  4>(-c)| 


r  v 


Pr 


1=1 


*(*) 


(  6A2  \l/5/  1  \ 

\n'/ATK)  \1  +  I*|16/V 


< 


Ci 


a2  y/5 

n^ATK) 


I 


5.4.3  Extra  votes 

In  this  section  we  examine  the  extra  votes  from  the  point  of  view  of  a  par¬ 
ticular  processor  P. 

Recall  that  <>,i  is  defined  to  be  l  if  the  vote  X{  is  cast  by  some  processor  Q 
before  P's  final  read  of  Q's  register  and  0  otherwise.  Clearly,  (A,  >  nt  since  P 
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could  not  have  started  its  final  read  until  the  total  variance  exceeded  K .  As 
discussed  above,  both  i  and  are  JF-i -measurable.  Thus  £,  =  (>,,  —  «,  is 
a  0  —  1  random  variable  that  is  ^i-i-measurable,  and  =  £‘=1 
is  a  martingale  by  Theorem  5.1. 

Define  A  =  n(gT^)a.  The  following  lemma  shows  a  bound  on  the  tails  ot 

ZtiXi. 


Lemma  5.12  For  any  x  >  0,  if 


holds  for  some  positive  d  <  x,  and 


<  1  + 


(*  ~  d )2 

2  log (n/p) 


(5.14) 


(5.15) 


holds  for  some  positive  p  <  n,  thru  for  each  processor  P , 

Pr[D<>.-  ~  <  A  -  xVk]  <  p/n.  (5.16) 

Proof:  The  proof  uses  Corollary  5.5,  so  we  proceed  by  showing  that  its 
premises  (stated  in  Theorem  5.4)  are  satisfied. 

By  Corollary  5.10,  A',  and  thus  ^.A",  is  zero  for  i  >  n(T '/v-  +  c  +  1).  So 
E&X,  =  SP'M  where  M  =  n(7'/N  +  c  4-  1). 

Set  Wi  =  |^iX,|.  Then  the  first  premise  of  Corollary  5.5  follows  from  the 
fact  that  for  each  i,  £  and  I  A',  I  are  both  JF, -measurable.  The  second  premise  * 
is  immediate.  For  the  third  premise,  notice  that 


E(l«l>!  =  E&*?  -  E -  E  <**?  <  E -  E «.*? 

The  first  term  is  * 

n  U, 

E-v,2  = 

.=i  >=i 


The  second  term  is 


2>..V,2  >  K-t 


7a 


53 


for  some  t  which  is  at  most  U,  for  some  i.  Thus 


>=i  j=i 

n  U,+l 

<  -a'  +  EE  j2a 

*=i  j=i 

<  -I<  +  '£(Ui  +  2)A/A. 


(5.17) 


Let  \(x)  =  (x  +  2)a/A.  Tim, 


f  ,  v  ((Ay),M  +  c  +  3)' 

X^M  +  c+l)  =  * - ^ - L 


1  A  A 


=  jE[^j(Ay)k/A(c  +  3)A-k 

For  y  >  0,  the  second  derivative  of  each  term  is  either  0  (when  k  =  A)  or 
negative;  thus  x{^~X{y)  +  e  +  1)  is  concave  and  Lemma  5.9  gives 


jE^<n(n.+c+1)  =  ^if< 


n{gTK)' 


(5.  IS) 


It  follows  from  (5.17)  and  (5.18)  that 


-  A'  =  A'(^  -  1) 

Applying  (5.4)  from  Corollary  5.5  now  yields,  for  all  A  >  0, 

Pr [Sh.m  <  -A]  <e-^/iAV",). 

If  (5.14)  holds,  then  A  <  d\/l\  by  Lemma  5.13.  So 

Pr[E^,  <  A-.rv'A7]  <  Pt\Spm  <-(x-d)VK] 

<  e-(x-d)iK/'2K(gA-\) 

_  e-{x-d)3/2(gA- l) 


15.19) 
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But  if  (5.15)  holds  then 


,/  - 1  < 

2  \og(n/p) 

and,  since  log( n/p)  >  0  and  g  >  l, 

~ 2{g^  —  \ )  ~  _log^n/p^  =  l0§(P/n)’ 

From  which  it  follows  that 

e-(r-<i)i/'2(//',-l)  <  (,U>«(p/n)  _  p / n 


I 

5.4.4  Written  votes  vs.  decided  votes 

In  this  section  we  show  that  the  difference  between  'ZXpiX ,  and  the  total 
vote  actually  read  by  P  is  bounded  by  A  =  n(pT/v')a- 

Lemma  5.13  Let  Rp  be  the  sum  of  the  notes  read  during  P’s  final  read. 
Then 

|£ (>,*.  -  Hr |  <  "( n  +  c  +  I )“  <  n(gTh  )a  =  A  (5.20) 

Proof:  Suppose  =  1,  and  suppose  X{  is  decided  by  processor  Pj.  If  the 
vote  Xt  is  not  included  in  the  value  read  by  P,  it  must  have  been  decided 
before  P's  read  of  Pfis  register  but  written  afterwards.  Because  each  vote 
is  written  out  before  the  next,  vote  is  decided  there  can  be  at  most  one  vote 
from  Pj  which  is  included  in  £  Cr.,X,  but  is  not  actually  read  by  P.  This 
vote  has  weight  at  most  (/“.  So  we  have  Qp.lXl  —  Rp\  <  £"=l  U?  Now  let 
x(x)  =  i°.  Then 

x  +  c+  i)  =  ((Ag)'IA  +  c+  i)u  =  it,  (°^J(Ay)k,A(c+  1)a_fc 

which  is  concave  since  the  second  derivative  of  each  term  of  the  sum  is  neg¬ 
ative.  The  rest  follows  from  Lemma  5.9.  I 
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5.4.5  Choice  of  parameters 

Let  us  summarize  the  proof  of  correctness  in  a  single  theorem: 


Theorem  5.14  Define 


A  =  2a  +  1 

'AK\1/a 


J 


TV  =  (&■ 

\  n 

i  c  +  3 

!l  =  1  +  — — 

1 1< 

and  suppose  there  exist  d  >  0,  r  >  d  and  positive  p  <  n  such  that  all  of  the 
following  hold: 


6/42 


< 

VS  ■ 

(5.21) 

< 

1+  <*-«’ 

(5.22) 

21og(n/p) 

< 

l 

(5.23) 

Then  the  protocol  implements  a  shared  coin  with  agreement  parameter  at  least 

A2  \  »/s 


1  - 


$(.r)  +  r, 


n'tATK 

where  C\  is  the  constant  from  Lt  rnma  5.11. 


+  P 


(5.24) 


Proof:  To  show  that  the  agreement  parameter  is  at  least  (5.24)  we  must 
show  that  for  each  z  €  {0,  1}  the  probability  that  all  processors  decide  z  is 
at  least  (5.24).  Without  loss  of  generality  let  us  consider  only  the  probability 
that  all  processors  decide  1;  the  case  of  all  processors  deciding  0  follows  by 
symmetry. 

Recall  the  definition  A  =  «(</7'a)“.  Suppose  that  21 X,  >  x\/K,  and 
that  for  each  processor  P,  2Z(C/,,i  —  k,).V,  >  A  —  x\fK.  Then  for  each  P  we 
have  21  O..X.  >  A  and  by  Lemma  5.13  P  reads  a  value  greater  than  0  during 
its  final  read  and  thus  decides  1 . 
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Now  for  this  event  not  to  occur,  we  must  either  have  £  *iXi  <  xyfK 
or  £(Cp,i  —  Ki)Xi  <  A  —  x\/77  for  some  P.  But  as  the  probability  of  a 
union  of  events  never  exceeds  the  sum  of  the  probabilities  of  the  events,  the 
probability  of  failing  in  any  of  these  ways  is  at  most 


Pr[H  v^]  +  E  Pr[E(CP‘  _  /c«)^*  <  A  - 

P 

.2  \  1/5] 


^)  +  C'{~) 


+  n(p/n ) 


(5.25) 


by  Lemmas  5.11  and  5.12.  So  the  probability  some  processor  decides  0  is  at 
most  (5.25),  and  thus  the  probability  that  all  processors  decide  1  is  at  least 
1  minus  (5.25).  I 

The  running  time  of  the  protocol  is  more  easily  shown: 

Theorem  5.15  No  processor  executes  more  than  (AK)1^A(2  +  n/c)  +  2c  4-2 n 
register  operations  during  an  execution  of  the  shared  coin  protocol. 


Proof:  First  consider  the  maximum  number  of  votes  a  processor  can  cast. 
After  [AI\)1/a  votes  the  total  variance  of  the  processor's  votes  will  be 

A 


(AK) 


l/A 


E  *2a> 


r=l 


MM<) 

Jo 


U* 


,2  a 


(lx  = 


{(AKy/A)' 
A 


=  K. 


so  after  at  most  an  additional  r  votes  the  processor  will  execute  line  1 1  of 
Figure  5.1  and  see  a  total  variance*  greater  than  I\.  Thus  each  processor 
casts  at  most  ( AI\)l/A  +  c  vote's.  But  each  vote  costs  1  write  operation  in 
line  8,  and  every  c  votes  costs  n  reads  in  line  11,  to  which  must  be  added 
a  one-time  cost  of  n  reads  in  line  13.  The  total  number  of  operations  is 
thus  at  most  (( AK )l//*  +  cj  ( I  +  \n/c] )  +  n  <  ((  AK)1/a  +  c)(2  4-  n/c)  +  n  = 
(AK)l/A(2  +  n/c )  +  2c  +  2 n.  I 

It  remains  only  to  find  values  for  a,  [\  ,  and  c  which  give  both  a  constant 
agreement  parameter  and  a  reasonable  running  time.  As  a  warm-up.  let  us 
consider  what  happens  if  we  emulate  the  protocol  of  Bracha  and  Rachman 
[BR91]: 


Theorem  5.16  If  a  =  0,  K  =  >\ nl,  and  c  =  4|°gn  —3,  then  for  n  sufficiently 
large  the  protocol  implements  a  shared  coin  with  agreement  parameter  at  least 
0.05  in  which  each  processor  executes  at  most  0(n2logn)  operations. 
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Proof:  For  the  agreement  parameter,  we  have  A  =  1,  Tk  =  4n,  and  g  — 
1  +  1/16  log  n.  Let  d  =  1/2,  .r  =  I,  and  p  =  1/10.  Then  (5.21)  holds  since 
ga  =  1  <  d\Jr^JnA  =  1.  Furthermore, 


1  + 


(f  ~  <0 
2  log(n/ p) 


2  \  '/  ■' 


-  1  +  o 


1 


>  l  + 


S(logn  -  log  p) 

_ 1_ 

1 6  log  n 


when  n 2  >  1/p.  Thus  (5.22)  holds.  The  remaining  inequality  (5.23)  holds 
for  n  >  2,  so  by  Theorem  5.  I  I  we  have  a  probability  of  failure  of  at  most 


A  (  l  \,/r’ 

<  0.S-I2  +  O  (2-)  +  °.l 

which  is  not  more  than  0.942  +  i  for  n  sufficiently  large.  In  particular  for 
n  greater  than  some  n0  this  quantity  is  at  most  0.95,  and  the  agreement 
parameter  is  thus  at  least  1  -  0.95. 

The  running  time  is  immediate  from  Theorem  5.15.  I 

Now  consider  what  happens  if  a  is  not  restricted  to  be  a  constant  0. 

Theorem  5.17  If  a  =  (logu  —  l)/2,  A  =  ( 16n  log  n)logTl(n/  log  a),  and  c  = 
n/logn  —  3,  then  for  n  sufficiently  large  the  protocol  implements  a  shared  coin 
with  constant  agreement  parameter  in  which  each  processor  executes  at  most 
0(n  log2  n)  operations. 


Proof:  We  have  A  =  logrc,  7),  =  I  (in  logn,  and  g  =  1  +  Let  d  =  1/2. 

i  =  l,  and  p  —  1/10. 

We  want  to  apply  Theorem  5.1  I,  so  fust  we  verify  that  its  premises  are 
satisfied.  To  show  (5.21).  compute 

/  i  \  (l«.gu-l)/2 

ga  —  M  _  j  <  n  <-  g  1/32 log  n 

v  16  log2  n) 
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which  for  n  >  2  will  be  less  Ilian  <l\jT^-/nA  =  2.  To  show  (5.22),  note  that 


9A  = 


=  1  + 


(i  log2  II 


l0R7l 


<  g  1  / 1 6  tog  n 


and  thus  log(p'4)  <  1/16  log//. 


log 


(x  -  d)2  \ 

2  log {n/p)J 


> 


lint. 

IOfi  ('  +  Slog(n/p)) 

1  1 
Slog (n/p)  128  log 2(n/p) 


1 _ 1 _ 

S(logn-logp)  128(log  n  —  log  p)2 


(using  the  approximation  log(  I  +  ./:)  >  x  —  For  sufficiently  large  n 

this  quantity  exceeds  1/16  log?/  and  (5.22)  holds.  The  remaining  constraint 
(5.23)  is  easily  verified,  and  thus  Theorem  5.1-1-  applies  and  the  agreement 
parameter  is  at  least 


1  - 


<&(1)  +  c, 


log2  n 


i/v- 


«'/  lo*rt(  16n  log  n ) 


+  1/10 


<  l  -  (0.8-12  +  O  ((log  n/n)l/s)  +  O.IO) 


which  is  at  least  0.05  for  sufficiently  large  n.  Thus  the  protocol  gives  a 
constant  agreement  parameter. 

Now  by  Theorem  5.15,  the  number  of  operations  executed  by  any  single 
processor  is  at  most  ( ,4/V ’),^',(2  +  "/<’)  +  2 r  +  b i.  or 


(log  n)l/Mo*n(16n  log  n){n/  log  ?i)1/,|o®n(9(log  n)  +  O(n) 


which  is  O(?ilog2n).  I 

It  follows  immediately  that  plugging  a  coin  with  the  parameters  of  The¬ 
orem  5.17  into  the  consensus  protocol  construction  of  Chapter  3  gives  a 
consensus  protocol  that  requires  an  expected  0{n  log2  n)  operations  per  pro¬ 
cessor.  It  is  not  difficult  to  see  that  the  best  bound  we  can  place  on  the  total 
number  of  operations  is  in  fact  n  times  this  quantity,  or  0(n2log2n).  The 
worst  case  is  when  each  processor  casts  the  same  number  of  common  votes. 


Chapter  6 

Conclusions  and  Open 
Problems 


In  this  thesis  I  have  shown: 

•  A  simple  algorithm  for  a  robust  wait-free  shared  coin  with  bias  at  most 
e  which  runs  in  an  expected  0(nA/e2)  total  register  operations. 

•  A  modification  of  this  algorithm  that  achieves  consensus  in  an  expected 
0(n 4)  total  register  operations,  and  which  can  be  implemented  using 
only  three  atomic  counters. 

•  The  asymptotically  fastest  known  wait-free  consensus  protocol  in  the 
per-processor  measure,  based  on  a  shared  coin  that  requires  only  an 
expected  0{n  log2  n)  register  operations  per  processor  to  achieve  a  con¬ 
stant  agreement  parameter. 

This  chapter  discusses  how  these  results  fit  into  the  history  of  wait-free 
consensus,  and  what  difficulties  need  to  be  overcome  to  make  further  im¬ 
provements.  It  concludes  with  a  list  of  open  problems. 


6.1  Comparison  with  other  protocols 

Table  6.1  gives  a  comparison  of  the  running  times  of  wait-free  consensus 
protocols  for  the  shared- memory  model.  In  this  table  the  quantity  p  is  the 
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Expected 

operations 

Per  processor 

Total 

Abrahamson  [Abr88] 

20(nJ) 

2°(n2) 

Aspnes  and  Herlihy  [AH90aj 

0(n4) 

0(n 4) 

Attiya,  Dolev,  and  Shavit  [ADS89] 

0(n4) 

0(n4) 

Chapter  4  ([Asp90]) 

0(n2(p2  +  n)) 

0(n2(p2  +  n)) 

Bracha  and  Rachman  [BR90] 

0(n(p 2  +  n)) 

0(n(p 2  +  n)) 

Dwork  et  al.  [DHPW92] 

0(n(i>2  +  n)) 

0(n(p2  +  n)) 

Saks,  Shavit,  and  Woll  [SSVV91] 

O(n') 

0(n *) 

Bracha  and  Rachman  [BR9IJ 

()(  n2  log  n) 

0(n 2  log  n) 

Chapter  5  ([AW92]) 

0(n  log2  n) 

0(n2  log2  n) 

Table  6.1:  Comparison  of  consensus  protocols. 


number  of  active  processors  as  defined  in  Section  4.4.  The  first  known  pro¬ 
tocol  was  the  exponential  protocol  of  Abraluunson  [AbrSSj.  The  first  known 
polynomial- time  protocol  was  that  of  Aspues  and  Herlihy  [AH90a].  Attiya. 
Dolev,  and  Shavit  [ADS89]  described  a  modification  of  this  protocol  which 
required  only  a  bounded  amount  of  space,  but  which  retained  the  spirit  of 
the  rounds-based  structure  of  the  Aspnes-Horlihy  protocol. 

The  protocol  of  Chapter  l,  which  also  appears  in  [Asp90],  was  the  first  to 
eliminate  the  use  of  rounds  by  using  a  robust  shared  coin.  Since  its  first  ap¬ 
pearance  its  performance  was  improved  by  a  factor  of  n  by  Bracha  and  Rach- 
man  [BR90]  and  by  Dwork  et  al.  [DHPVV92].  Both  groups  achieved  the  im¬ 
provement  by  replacing  the  0{n2)  implementation  of  an  atomic  counter  with 
a  weaker  primitive  that  required  only  0{n)  register  operations  per  counter 
operation,  and  acted  sufficiently  like  a  counter  to  make  the  consensus  proto¬ 
col  work. 

The  first  protocol  to  use  tin*  idea  of  casting  votes  until  a  quorum  is  reached 
(instead  of  until  a  sufficiently  large  margin  of  victory  is  reached)  was  that 
of  Saks,  Shavit,  and  Woll  [SSVV91],  Their  protocol  was  optimized  for  the 
special  case  where  nearly  all  of  the  processors  are  running  in  lockstep.  Bracha 
and  Rachman  [BR91]  noticed  that  the  protocol  could  be  sped  up  by  having 
each  processor  read  all  the  registers  only  after  every  0(n/  log  n)  votes;  the 
resulting  protocol  is  a  special  case  of  the  protocol  of  Chapter  5  obtained  by 
setting  a  to  0.  The  protocol  of  Chapter  5,  which  also  appears  in  [AW92],  is 
the  first  to  use  votes  of  unequal  weight,  and  as  a  result  is  the  first  for  which 
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the  maximum  expected  number  of  operations  executed  by  a  single  processor 
is  more  than  a  constant  factor  less  than  the  maximum  executed  by  ail  of  the 
processors  together. 

6.2  Limits  to  wait-free  consensus 

The  table  shows  a  considerable'  evolution  of  wait-free  consensus  protocols 
since  Abrahamson’s  exponential  solution.  It  is  natural  to  ask  how  much 
better  consensus  protocols  can  still  get. 

One  limitation  we  quickly  run  into  is  the  following.  If  a  processor  is  run¬ 
ning  by  itself,  it  must  read  every  other  processor’s  register  at  least  once.  If 
not,  it  cannot  distinguish  the  situation  where  it  really  ran  first  all  by  itself 
from  the  situation  where  some  other  processor  (whose  register  it  has  not  read) 
ran  to  completion  before  it  started.  In  the  latter  case  the  processor  would  be 
required  by  the  consistency  condition  to  agree  with  its  unseen  predecessor: 
but  without  reading  that  predecessor's  register  it  would  have  no  way  of  know¬ 
ing  which  value  to  choose.  Thus  in  any  wait-free  consensus  protocol  some 
processor  can  always  be  forced  to  execute  at  [east  n  —  I  read  operations. 
This  fi(n)  lower  bound  is  unaffected  even  if  the  adversary  is  substantially 
weakened;  the  argument  remains  valid,  for  example,  if  the  adversary  is  not 
allowed  to  see  the  internal  states  of  processors  or  even  if  it  is  required  to 
specify  all  of  its  scheduling  decisions  before  the  protocol  starts.  So  in  fact 
the  O(n\og2  n)  protocol  we  have  described  here  is  close  to  the  best  we  can 
hope  for  in  the  per-processor  measure,  given  the  assumption  of  single-writer 
registers,  even  against  relatively  weak  adversaries. 

On  the  other  hand,  the  question  of  how  far  the  total  number  of  opera¬ 
tions  can  be  reduced  does  not  have  as  easy  an  answer.  That  some  processor 
can  be  forced  to  execute  Q(?t)  operations  does  not  mean  that  all  processors 
can  be  forced  to;  it  could  be  the  case  that  if  the  processors  cooperate  they 
could  collectively  gather  information  about  the  state  of  the  system  faster 
than  they  would  independently.  In  fact,  the  best  known  lower  bound  for 
expected  total  operations  is  only  Q(nlogn),  based  on  the  minimum  num¬ 
ber  of  operations  needed  to  communicate  every  processor’s  state  to  every 
other  processor  (SSW9I).  Furthermore,  the  fact  that  the  protocol  of  [BR91] 
achieves  a  bound  of  0(n 2  logrt)  on  total  work  shows  that  some  improvement 
is  possible  on  our  protocol,  though  possibly  only  at  the  expense  of  increasing 


the  per-processor  bound. 

However,  to  get  below  17(;t2)  operations  will  require  at  least  two  break¬ 
throughs.  The  first  problem  is  that  all  of  the  algorithms  we  currently  have 
require  that  every  processor  read  every  other  processor’s  register  directly  at 
some  point,  which  takes  Q(n2)  total  operations.  It  seems  likely  that  some 
sort  of  randomized  cooperative  technique  could  allow  this  dissemination  of 
information  to  proceed  more  quickly  (possibly  at  the  cost  of  using  very  large 
registers);  but  at  present  no  such  technique  is  known. 

The  second  problem  is  that  to  reduce  the  total  number  of  operations  below 
Q(n2)  it  will  be  necessary  to  reduce  tlu*  number  of  local  random  choices  below 
Q(n2),  as  local  coin-flips  that  have  no  writes  between  them  effectively  con¬ 
solidate  into  a  single  random  choice*  from  the  point  of  view  of  the  scheduler. 
This  problem  appears  more  difficult  than  flu*  first,  as  it  requires  abandoning 
the  voting  technique  at  the  heart  of  all  currently  known  wait-free  consen¬ 
sus  protocols.  The  reason  is  that  in  these  protocols,  the  scheduler’s  power- 
only  becomes  limited  when  the  standard  deviation  of  the  total  vote  becomes 
comparable  to  the  sum  of  the  votes  that  the  scheduler  can  withhold.  With 
unweighted  votes,  Q(n 2)  votes  are  required;  for  weighted  votes  the  situation 
is  only  made  worse,  as  increasing  the  weight  of  some  votes  increases  the  sum 
of  the  withheld  votes  more  quickly  than  it  increases  the  standard  deviation  of 
the  total  vote.  It  appears  that  it  will  be  difficult  to  get  below  f 2(n2)  without 
adopting  some  decision  method  that  takes  more  account  of  the  ordering  of 
events  in  the  system. 


6.3  Open  problems 

The  consensus  protocol  described  in  Chapter  5  comes  quite  close  to  the  limits 
of  current  methods  for  solving  wait-free  consensus.  Aside  from  optimizations 
such  as  eliminating  the  logrt  factors  from  the  per-processor  bound  or  reducing 
the  value  of  n  at  which  the  protocol  becomes  practical,  essentially  the  only 
question  remaining  is  whether  the  total  number  of  operations  can  be  reduced 
substantially.  There  are  several  questions  whose  answers  would  shed  light  on 
this  problem,  as  well  as  many  other  problems  in  the  area: 

1.  Is  it  possible  in  the  asynchronous  shared-memory  model  for  n  proces¬ 
sors  to  collectively  read  n  registers  in  fewer  than  0(n2)  total  operations? 


2.  Does  every  consensus  protocol  contain  a  shared  coin? 

3.  Can  a  shared  coin  with  constant  agreement  parameter  be  built  that 
requires  less  than  H(n2)  total  operations?  (A  closely  related  question: 
can  a  shared  coin  of  arbitrarily  small  bias  e  run  in  less  than  D(n2/e2) 
total  operations?) 
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